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Abstract 

In the domain of computer science, particularly VLSI <3AD, an increasing amount of 
engineering time is spent running compute-bound programs. Many of these programs have 
an intrinsic parallelism that is externally accessible. This thesis describes a novel software 
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passing system is developed and discussed. The system is applied to design rule checking 
by executing each rule on a separate processor, and the results are analyzed. 
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Preface 



The text of this thesis was formatted using I^TgjX. The more complex figures and 
drawings were generated automatically by the software described in this thesis, using a 
graphical description language called Postscript 1 . The two forms of printed media were 
merged together electronically (rather than photographically or with scissors and glue). 
The capability of automatically merging text with graphics in this manner has made prac- 
tical the inclusion of a flipbook animation. 

The animation is an attempt to show how the behavior of the multiprocessing task 
scheduling algorithm changes according to the number of processors. The parallel execution 
is simulated under the assumptions of zero communications overhead and unit execution 
time for each task. The nth "frame" in the animation is a graphical representation of the 
execution using n processors. The graph is composed of diamonds, which represent atomic 
units of computation called tasks. Horizontally adjacent diamonds represent tasks that 
are executed in parallel on different processors. The vertical axis represents time, with the 
beginning of the computation at the top of the page. 

Initially, each frame occupied a single page, and the effect of flipping through the 
100 pages was aesthetically pleasing. Unfortunately, in terms of the thesis, it was not 
justifiable as a 100 page appendix. So each frame was reduced so it would occupy the 
top and right margins of an existing page. The first frame of the animation is visible at 
the right edge of this page. It shows how when using one processor, no more than one 
task can be processed at any given time, so they are all executed one after the other. The 
next page shows the simulation using two processors. The meaning of the animation will 
become clearer after reading Chapters 3 and 4, but it was necessary to include a word of 
1 Postscript is a trademark of Adobe Systems 
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explanation here, since this is where the animation begins. 
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Chapter 1 



Introduction 



As the complexity of VLSI circuits increase*, so does the running time of the GAD 
tools we use to build those circuits. At the current state of VLSI technology and CAD tool 
performance, tasks such as layout verification, simulation, and mask-making have proven 
to be expensive bottlenecks in the VLSI design process- If the advances in the complexity 
and functionality of the VLSI chips we build ate to keep pace with advances made in VLSI 
process technology, then we must make substantial improvements to the software tools 
used to design and manufacture those chips. 

1.1 Accelerating CAD Tools 

There are several ways to accelerate CAD tools: 

1. Developing more efficient software 

2. Buying faster general purpose computers 

3. Using special-purpose hardware accelerators 

4. Exploiting the hierarchy inherent in the represention of VLSI circuits 

5. Exploiting the parallelism inherent in many of the existing CAD tools 

Developing more efficient software is always an attractive alternative. In indus- 
trial design rule checking, ECAD's DRACULA2 offered an order of magnitude speed-up over 
what was previously available 1 . However, significant runtime improvement through better 

1 DRACULA2 is a trademark of EC AD corporation 
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software is often limited by the computational complexity of the problem at hand. Ob- 
servations in industry indicate that further improvements are needed in the time taken by 
design rule checkers. 

Buying faster general purpose computers is perhaps the lowest-risk option listed 
above. If purchasing new hardware will increase both the processing and the memory 
speed, then it will certainly increase the speed of the CAD tools that run on it. This 
strategy has the added advantage that it can be easily combined with any of the other 
strategies. However, since monetary cost rises faster than computational speed, it is not 
a cost-effective solution. This is evidenced by a comparison of Digital's VAX 8600 and 
MicroVAX II computers 3 . They were both introduced in early 1985, so they represent 
roughly the same level of technology. The VAX MOO computer has approximately 5 times 
the processing speed of the Microvax H computer, but costs about 10 times as much. 
Relying on faster computers is also not likely to be a good long-term solution, because 
recently the complexity off VLSI circuits has grown much faster than the cost of processing 
speed has fallen. Digital's VAX 8600 computer has four tuaaes the speed and twice the cost 
of the VAX 11/780 computer (1077), while chips cf 1985 have twenty-five times as many 
transistors as those of 8 years ago [Allen 1983]. 

Developing special hardware accelerators offers the greatest potential of all the 
solutions listed above. Runtime improvements of several orders of magnitude are not 
uncommon. In design rule checking, speedup factors of up to 140 have been predicted 
using small amounts of custom hardware [Seller 1085]. Similar improvements have been 
achieved in circuit simulation using the ZYC AD hardware accelerator 3 . Unfortunately, 
the cost of these devices, both in money and development time, is often prohibitive. In 
the event of an algorithmic improvement that decreases the growth rate of a problem, the 
hardware will lose its edge as the problem increases in sise, rendering it obsolete. 



3 VAX and MicroVAX an trademark* of Digital Equipment Corporation 
3 ZYCAD is a trademark of ZYCAD 
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1.2 Parallelism in VLSI CAD 

Exploiting the parallelism inherent in VLSI CAD tools is an attractive way to 
accelerate them. The nature of VLSI lends itself to a high degree of parallelism. VLSI 
chips are composed of several layers, which are often examined separately. Each layer is 
composed of different blocks which exhibit a very high degree of functional locality. Each 
block is composed of many polygons which exhibit some degree of geometric locality. 

We observe that the parallelism inherent in VLSI is manifested in CAD tools in 
several different ways. Logic simulators possess parallelism based on the locality of ac- 
tivity in a digital circuit. [Arnold 1985] exploits this property in a multiprocessing logic 
simulator based on RSIM [Terman 1983]. Design rule checking and circuit extraction can 
be accelerated by taking advantage of the geometric locality of the polygons that consti- 
tute the chip. [Levitin 1986] describes a system that uses this approach to accelerate a 
VLSI circuit extractor called IV [Tarolli, Herman 1983]. Similarly, [Bier, Pleszkun 1985] 
describes a system that divides a layout into separately checkable partitions, checks each 
partition, examines the partition boundaries to eliminate false errors and catch missed er- 
rors, and merges the resulting error reports together. In design rule checking, there is also 
parallelism inherent in the set of design rules that guide the checking program. This thesis 
describes a DRC accelerator that exploits the parallelism inherent in the design rules. 

1.3 A Software Methodology for Multiprocessing 

If an existing program can be partitioned into tasks that are each sufficiently time- 
consuming compared to the time it would take to move the task's input and output data 
between processors, then an existing local area network may be effectively used as a mul- 
tiprocessor to run that program. This is the case with DRC, and is likely to be the case 
with Digital's circuit extractor and mask-making software. If several processors share a 
common file system, such as in VAXclusters, then the input/output size constraint can be 
removed 4 . 



4 VAXcluster is a trademark of Digital Equipment Corporation 
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Parallelism ia a cast-effective strategy for accelerating VLSI CAD tools. No special- 
purpose hardware is needed. It is possible to use a small number of general purpose 
computers as a multiprocessor. Thus the utility and the expense of an n-processor system 
can be shared with those who need more serial processing machines. Parallelism can 
be combined with higher-speed general purposes computers and with higher-performance 
software. 

Many GAD tools have some parallelism, due to the nature .of VLSI. So a hard- 
ware investment made toward farter DRCs may also pay off by accelerating simulations, 
mask preparations, and circuit extractions. Another example of exploitable parallelism is 
compiling and finking a large software system. 

This thesis describes a software system called tPIC (Exploiting Parallelism In 
CAD) that controls the parallel execution of any soft ware system that exhibits a restricted 
class of parallelism. The necessary characteristics of the computational environment and 
the program to be accelerated are as follows: 

• The program must be partitioned into discrete tasks. 

• Each task must be individually callable from the operating system. 

• All communication between tasks mart be done through disk files. 

• Unless different computers can share the same fie system, the time it takes to execute 
an individual task mart U greater than the ti»sit takes to transfer the files that it 
reads and writes. 

1.4 Chapter Outline 

Chapter 2 describes previous work in accelerating CAD tools. This includes ef- 
forts to use parallelism and hardware acceleration to speed up design rule checking and 
simulation. The primary motivation for this thesis, Parallel EC AD DEC, is described. 

Chapter 3 describes the theory and implementation lit PIC. The more interesting 
features, such as task scheduling, are described in detail. 

Chapter 4 describes the application of t PIC to various problems, such as design rule 
checking, circuit extraction, and compiling and linking programs. Optimistic predictions 
are made for the speed-up of each application. The speed-up factors are determined for 



18 



several experimental runs of each application. The experimental results are then compared 
to the optimistic predictions. 

Chapter 5 concludes the thesis with a summary of the work reported and suggestions 
for future research. 

Appendix A contains a user's manual for running EPIC with EC AD DRC. 

Appendix B contains graphical representations for the data dependency graphs for 
several applications. 

Appendix C contains the raw data for the experimental runs, including a table of 
statistics and a graphical representation of the task assignments for each slave. 

Appendix D contains all the messages EPIC sends for control communication. They 
effectively define the architecture of the software behind EPIC. 
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Chapter 2 



Background 



2.1 Previous Work in Parallelism for VLSI CAD 

A substantial amount of research has recently been devoted to the area 
of parallel simulation. Papers have been published on the parallel acceleration 
of several classes of simulators, including relaxation based simulators such SPLICE 
[Newton, Sangiovannl-Vmcentelli 1983,Deutsch, Newton 1964], and event based logic sim- 
ulators such as RSIM [Terman 1983,Arnold 1985]. 

Until very recently, not much had been published on parallel design rule check- 
ing. In the past year, there has been more activity [Bier, Plesxkun 1965,Nielson 1986]. 
[Bier, Pleszkun 1985] seeks to exploit the geometric locality of VLSI layouts by dividing 
the layout into vertical slices, checking each slice on a separate processor, and merging the 
error reports together. This approach could suffer from a large number of missed errors 
and false errors at the borders of the slices. At some cost in redundant computation, these 
problems can be eliminated by dividing the chip into slices that overlap by at least one 
maximal design rule interaction distance (DRID). Errors reported within one DRID of the 
border of a slice are filtered out in the merge phase as potential false errors. If they are 
real errors, they will be flagged during the check of the neighboring slice. 

This strategy was not tested on a real multiprocessor, but based on statistics gath- 
ered during serial runs, a speedup of 8:1 was predicted for 14 processors. As communi- 
cations costs are small, this figure may be realistic. It is not seasonable to expect this 
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algorithm to offer a linear speed-up factor, since the computational overhead of process- 
ing overlapping slices and discarding errors at the borders will grow with the number of 
processors. 

The data partitioning algorithm has the desirable property of having its potential 
parallelism scale as a function of complexity of the layout. If then is no communications 
overhead, then we should then be able to use mora processors to hold the DRC execution 
time constant as the circuit grows. Unfortunately, the overhead of checking overlapping 
regions of the chip and removing false errors from the reports may reduce the potential 
speedup significantly, and prevent the number of processors from being profitably scaled 
with the layout. This thesis presents an alternative strategy that has no intrinsic compu- 
tational overhead* Unfortunately, the parallelism of cm technique does not grow with the 
complexity of the layout, but with the complexity of the rules set Nevertheless, it promises 
to allow more efficient use of each processor, and therefore prcnride better speed-up factors 
for limited numbers of proce ssors . 

2.2 Previous Work in Accelerating DRC 



Empirically, the time and space consumed by a design rule check has been observed 
to be about 0(n lJt ) or 0{n 1 *) when n is tins number of transistors. As the number 
of features on a typical VLSI chip moves into the millions, DRC will become more of a 
bottleneck in the designers' loop. 

Hierarchical DRC is one possible solution to the DRC problem, and 
has recently been studied extensively ([McGrath, Whitney I960], [Whitney 1981], 
[Newell, Fitzpatrkk 1982], [Smith* McDonald, Chang, Jerdonek 1984]). In a normal chip, 
many cells are defined in terms of other cells, and blocks of cells are repeated (such as in 
a memory). Hierarchical DRC attempts to exploit this repetition by checking only one 
instance of a given cell or cell block, regardless of how many times it occurs. This has the 
added advantage of only generating one error when a repeated cell is faulty, thus reducing 
the volume of error reports while still conveying the same information. 

In practical applications, however, the amount of repetition is limited by various 
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factors, such as overlapping regions and globally routed conductivity [McGrath 1985], Of- 
ten, the advantage to be gained by exploiting the repetition is lost to the overhead of 
finding and re-checking the cases where a cell's boundaries are violated by other layout. 
Thus, while hierarchical DRC is profitable for certain chips, it is not yet a sufficiently 
general solution. When it is does become profitable, it can be combined with the multi- 
processing DRC algorithm presented in [Bier, Pleszkun 1985], or the approach presented 
in this thesis. 

At Hewlett-Packard, hierarchical DRC has been successfully used in practice 
[Hammer 1986], Using a core of checking routines based on NCA's VDRC 1 , a methodol- 
ogy was developed whereby the layout designer DRCs cells as they are initially layed out. 
The CAD system maintains a central database of cells, keeping track of whether any cell 
has been modified since it was last checked. When a cell is instantiated, only externally 
visible geometry is checked in subsequent DRCs. This system is especially effective be- 
cause the cost of checking each cell is spread throughout the design process, rather than 
lumped together at the end. The disadvantage is that the designers must completely avoid 
overlapping cells with other cells and with routing. 

It has been suggested that the DRC bottleneck can be eliminated by "correctness by 
construction" [McGrath, Whitney 1980]. This involves using layout systems that enforce 
the design rules at the construction phase, making it impossible to violate a design rule. 
Such layout systems tend to use design rules that are too simplistic, resulting in poor layout 
density, and thus producing slow chips [McGrath 1985]. Specifically, the corner stitching 
structures of Magic do not provide for 45° angle geometries [Taylor, Ousterhout 1984]. 
Modern industrial design efforts require this capability. 

Advancements in the algorithms behind design rule checking have improved 
the overall performance ([Wilcox, Rombeek, Caughey 1978], [Arnold, Ousterhout 1982], 
[Chapman, Clark 1984]). For example, Chapman and Clark outline a method for im- 
proving the performance of IBM's Unified Shapes Checker by using scan lines. On chips 
with more than 50,000 transistors, they realized a CPU-time reduction of more than 50%. 
This savings is substantial, but they predict that the improvement will not be sufficient to 
1 VDRC is a trademark of NCA corporation 
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swiftly check chips as transistor counts move into the millions. 

SeUer describes a method for doing DEC '» in hardware [SeUer 1985]. This method 
has obvious advantages. Dedicated and custom-designed hardware can do a good job 
of exploiting "inner-loop" parallelism. However, a working prototype was not produced. 
Until the introduction of a production quality hardware DRC accelerator, it may be more 
timely to increase performance by augmenting the existing CAD software. 

2.3 Motivation: Parallel DRC 

Digital Equipment Corporation's primary motivation for supporting this project 
was to produce a system that runs parallel ECAD DRCs. The key observation that mo- 
tivated our strategy is that a design rule check does not entail the execution of a single 
algorithm, but instead involves the sequential execution of many computationally inde- 
pendent algorithms. More specifically, DRC is a sequence of rules, such as the following: 

1. POLY-DIFF SPACING > 1A 

2. POLY-POLY SPACING > 2A 

3. POLY WIDTH > 2A 

4. GATE OVERLAP > 2A 

Conceptually, there is no date dependency between these rules. Therefore, each 
rule can be executed independently by a separate processor. That is not very efficient, 
because there are often intermediate computations which contribute to the checking of a 
rule, and the results of these computations are often used in the checking of more than 
one rule. We would like to do these computations only once, and share the results among 
all those processors that need them. 

These intermediate computations are explicitly listed in the ECAD rules file that is 
used to control each DRC run. The rules file is essentially a computer program written in 
a language especially tailored for DRC. The language has statements that do operations on 
the various layers of the chip, such as polysfficon and diffusion. Some statements do logical 
operations such as the pixel-wise AND and OR of two layers, producing new layers. Other 
statements do spacing or width checks on a given layer at a given tolerance, producing 
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error reports. A side effect of the execution of the program is that all the rules are checked. 
As a final step, all the error reports are appended into a summary file, and the geometry 
of the errors is depicted in an "error cell* 1 layout that can be read into the layout editor. 

Each statement in the rules file can be mapped directly onto a sequence of operating 
system commands that cause the statement to be executed. The input and output file 
names can be extracted from the text of the rules file statement. By comparing the input 
file names of one statement to the output file names of another statement, we can determine 
whether there is a data dependency between the execution of those two statements. In this 
manner, we can build a data dependency graph from the rules file, with the information 
about how to execute each statement stored at each node. 

The data dependency graph has a set of root*, or nodes whose input files are part 
of the input data to the whole task, rather than outputs of another node. The number of 
roots is generally equal to the number of different VLSI layers for the particular process 
technology. The computation must begin with the roots. How the computation proceeds 
depends on the scheduling strategy, and greatly influences the performance of the whole 
parallel execution. 

2.4 Scheduling strategies 

The following approach is taken by EC AD in their marketed version of Parallel 
DRACULA 2 [Nielson 1986]. It requires a multi proc es so r with a shared fHesystem, such a 
VAXcluster; it won't run on a local area network. This implies that it won't suffer file 
transfer overhead. It also depends on the scheduling facilities buSt into the multiprocessor. 
When submitting a non-interactive (batch mode) jab to a VAXcluster, the VAX/VMS 
operating system 9 determines which processor is most responsive, and assigns the job 
accordingly. 

The first step is to divide the data dependency graph into sections, as shown in 
Figure 2.1. Each section contains all the nodes i$ the graph that have a given distance 

Parallel DRACULA is a trademark of EC AD corporation 
3 VMS is a trademark of Digital Equipment Corporation 
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Figure 2.1: Sectioned Data Dependency Graph 



from the roots of the graph, where distance is simply the number of nodes one must pass 
through to arrive at the destination. For example, the roots comprise a section whose 
distance is zero. 

The approach proceeds by executing each section one at a time. Every node in 
the current section must be completed before any node in the next section can start. 
This guarantees that the data dependencies will not be violated. It is also very easy to 
implement. The parallel execution is controlled by a command file. 

There are at least two substantial drawbacks to this method. At the end of the 
execution of each section, the faster processes* will remain idle while the slower processors 
finish up their tasks. At best, this severely limits the number of processors that can be 
profitably used. At worst, it implies that a proc esso r that becomes severely overloaded or 
hung (for example, due to another user) after a task has been assigned to it is guaranteed 
to block the execution of the DEC. Another drawback to ECAD's method is that the 
requirement that it be run on a VAXcluster is inconvenient; Digital would like to run 
parallel DRCs on VAX computers that am not VAXcmstered together. 

By more cleverly using the data dependency graph, we can increase the potential 
parallelism substantially, keeping each processor busy nearly all the time, thereby enjoying 
increased performance compared with ECAD's method. To do this, we need to layer a 
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sophisticated parallel scheduling and execution system around ECAD DRC. 

Unfortunately, ECAD DRC represents a true "black box* abstraction: the source 
code is not for sale. Furthermore, its user interface was not designed to be used as an 
interface to another program. Though the command interface to any given version of the 
software may be sufficiently documented, it is not guaranteed to remain stable over time. 

A system that is layered around such an inaccessible piece of software must be 
written to be resilient to change in the interface to that software. Also, it must not depend 
on specific restrictions that may only apply to the current version of ECAD. One such 
restriction is that that each line in the ECAD rules file corresponds to a task with no more 
than two input and output files. It is conceivable that this restriction could disappear at 
the whim of an ECAD engineer. 

The way to achieve this resiliency is to try to choose a model for the computational 
structure of ECAD's DRC that is general enough to be adaptable to any conceivable change 
that ECAD might make. The following chapter describes how this is done. 
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Chapter 3 

EPIC: A general method of exploiting 
parallelism 



This chapter describes the implementation of » software system called SPIC (Ex- 
ploiting Parallelism In Cad). SPIC provides a mechanism for controlling the parallel 
execution of existing software that exhibits a specific class of intrinsic parallelism. SPIC 
was written in PL/I for the VAX/VMS operating system, and runs on any number of VAX 
computers connected by DECnet or in a VAXcluster 1 . No special hardware configurations 
are required. Between the SPIC kernel and the preprocessors provided for running ECAD 
DRCs and Makefiles, 8751 total lines containing 6548 PL/I source statements were writ- 
ten. 

3.1 Dividing the job 

The system described here provides a mechanism for running Parallel DRC by solv- 
ing the more general problem of how to control the parallel execution of any program that 
can be externally divided into a finite set of tasks. We define task as a unit of computation 
that can be executed using a sequence of standard operating system commands (such as 
DCL commands, for the VAX/ VMS operating system). Each task has a known, finite set 
of inputs and outputs, each of w hich is a disk file. These tasks are explicitly specified in 
1 DECnet is a trademark of Digital Equipment Corporation 
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the manner of Figure 3.1. 

task "split"- 

/input" (chip . data) - 

/output- (left. data, right.data)- 

/dcl-("$splitter chip left. right") 

task "left"- 

/input- (left . data) - 
/output" (left. errors) - 
/dcl-("$drc left") 

task "right"- 

/input- (right . data) - 
/output* (right . errors) - 
/dcl-("$drc right") 

task "aerge"- 

/input-(lef t . errors .right .errors) - 
/output" (chip . errors) - 
/dcl*("tmrge left. right chip") 




Figure 3.1: Sample Task Description List and Data Dependency Graph 

The strategy we win use for Parallel DRC involves distributing the design rules to 
the various processors. Each processor applies its subset of the rules to the whole chip. But 
IP IC is not restricted to this form of parallelism, which is called instruction partitioning. 
As hinted at in Figure 3.1, tP JC is well suited to data partHiomn§. The multiprocessing 
DRC scheme proposed by [Bier, Plesskun 1965] could easily have been implemented with 
EPIC. 

A simple way to determine whether or not we can expect SPIC to be able to 
enhance the performance of a given program is by comparing the sizes of the input and 
output files of each of its tasks with the time it takes to execute those tasks. If the execution 
time is far greater than the amount of time it takes to transfer the input and output 
files between the various processors, then the potential exists for substantial throughput 
improvements using SPIC. Of course, if all of the processors share a single file system, 
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then data communication becomes less of a bottleneck, and the restriction can be relaxed. 

The extent of the parallelism, and hence the potential for throughput enhancement, 
is further limited by the data dependencies within the task list. By comparing the inputs 
and outputs erf each task, we can generate a data dependency graph, as shown in Figure 
3.1. 

In Figure 3.1, the potential parallelism is limited to a maximum of two processors. 
If we assume that each task takes one "tick", then by using two processors we can do 
the job in 3 ticks, whereas we would need 4 with a single processor. Due to the data 
dependencies, a third processor couldn't be used at all. So we say the parallelism has a 
maximum extent of 2. 
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Figure 3.2: A More Interesting Data Dependency Graph and its Execution 

The most obvious way to try to determine the extent of parallelism is to find the 
width of the widest row in the graph. This worked in Figure 3.1, and clearly having that 
many processors would yield the fastest possible execution time. However, by assuming 
that each task executes in one tick, we can do just as well using fewer processors. Consider 
the data dependency graph in Figure 3.2. The maximum extent of parallelism is now 3, 
since we can keep 3 processors busy at time = 2. But the minimum extent of its parallelism 
is 2, because "4" can be executed by the second processor during the third tick, while the 
first processor is executing "5". £ PIC tries to optimise task scheduling in this manner so 
it can get the most performance out of the available processing power. 
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3.2 Multiprocessing on a Local Area Network 

The computational model I have selected for parallel processing is not unlike the 
dataflow model Of course, the rise of each atomk computation is somewhat smaller in 
dataflow, so the capacity for incurring overhead from controlling the computation is also 
smaller. Hence, I use a significantly different approach to coatrofliflg the computation in 

epic. 




Figure 3 A: Star Network Topology 

Ethernet 3 technology is used as the physical layer beneath the DECnet protocol 
in DEC's local area networks'. Ethernet is essentially a coaxial cable that connects each 
node on the network. A processor sends a message by broadcasting it over the cable. 
Each processor receives all the messages and scans them for the ones that are addressed 
to it. Conceptually, an Ethernet can provide the bans for a variety of software network 
topologies. The topology EPIC uses is a star network, as shown in Figure 3.3. The 
processor at the center of the star, called the master, is responsible for controlling the 
whole execution. One of the processors on the points of the star is used to provide a user 

2 Ethcmet u a trademark of Xerox Corportioii 

3 DEC k a trademark of Digital Equipment Corporation 
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interface for the master. An interactive program called MONITOR is run on this computer 
to allow a human to control the execution. The remaining processors at the points of the 
star, called slaves, are responsible for executing whatever tasks the master assigns, and for 
transferring the appropriate input and output files. 

There were several specific engineering factors considered in the decision to use a 
star network topology. The programs we intend to run in parallel tend to have irregular 
computational structures. Their data dependency graphs take on arbitrary shapes, forcing 
us to spend considerable effort trying to keep each processor busy. This is further compli- 
cated by the computational environment in which we run. Each processor is a time-sharing 
computer, and while we expect that £ P IC would onty be run when it wouldn't be compet- 
ing for cycles, we can't let a loaded processor slowdown the rest of the computation. Thus 
a fragile task scheduling strategy would involve allocating each task a to specific processor 
before the computation begins. A more robust task scheduling strategy is to dynamically 
assign computable tasks to available processors, so a relatively slow processor will execute 
proportionally fewer tasks. Fortunately, since each task takes so much time, we can afford 
to incur some computational overhead figuring out the best strategy for assigning tasks 
to processors. A good way to do that is to have one processor running a master program 
that has total control of the computation. 

As it turns out, the master does not take very much CPU time once some initial 
preprocessing has been done. Most of the time, it's just waiting for a slave to indicate that 
it is finished with its task. The short burst of CPU time it needs to figure out which task 
gets allocated to the free slave is small compared to the time it takes the slave to finish the 
task. Experimentally, I have determined that the master can efficiently share a processor 
with a slave. 

It is enlightening to look at an example which is not conducive to a star network 
topology. In regular parallel structures, it is easy to predetermine the best way to allocate 
processors to tasks. Systolic arrays are one way of executing such computations. Central 
control of each processor in a systolic array is undesirable, since there is typically a large 
amount of communication between neighboring processors, but very little other commu- 
nication. It is better to have each processor know precisely how and when to talk to its 
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neighbors than to have one processor take responsibility for relaying all communication 
from the sender to the receiver. Due to the extreme volume of information passing through 
it, that processor would then be a severe bottleneck in the computation. 

Another class of applications that are not well suited to the SPIC model of compu- 
tation are those where it is not dear at the start of the program exactly what computation 
will occur. The task breakdown is done at run time, rather than •compile" time. If this 
is the case, IP1C will not be able to efficiently schedule the tasks. 

A good example of this is Parallel RSIM [Arnold IMS). It uses a master-slave star 
network configuration as its multiprocessor, but there is no finite set of tasks from which 
to generate a data dependency graph, since RSIM is anevent bated simulator. A change in 
the value of a node in the circuit causes a simulation of the surrounding devices. If this 
simulation causes other node values to be changed, then the devices connected to those 
nodes are simulated as well. Tins propagation continues until the network settles. There is 
no way to predetermine exactly what computation will occur when a given node changes. 
Instead, before any simulation occurs, Parallel ISM exploits functional locality in the 
circuit by partitioning it and sending one section to each processor. The various sections 
are simulated independently until a value on a shared node changes. The processor that 
changed the node then sends a message to other pr oc esso rs that share the node indicating 
the new value and the simulated time when the change occurred. SPIC is not equipped 
to deal with this sort of computation. It needs to know about each task in the problem 
before it can begin. 

3.3 Software Architecture 

IP IC is composed of three separate programs, MONITOR, MASTER, and SLAVE. Each 
is run in a separate process. These processes can be on different computers. Normally, 
one would run the MOIITOR, MASTER, and one SLAVE all on one processor, since MOHITOR 
and MASTER take almost no CPU time during the computation. 

The three programs communicate by passing uminiflrn Using VAX/ VMS mailboxes 
and the DECnet interprocess communication protocol, a message passing subsystem was 
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developed. It provides a uniform procedural interface to allow programs to easily handle 
a variety of asynchronous events, such as subprocesses, timers, multi-client interprocess 
communication, and terminal I/O. 

The MONITOR is the only program with which the user interacts. It allows the user 
to initiate and control the parallel execution, and provides a periodically updated display 
of the status of each SLAVE'S process. For more information about the MONITOR, see the 
£/>JC/DRACULA User's Manual in the Appendix. 

The most interesting program is the MASTER. It is initiated by a user instruction to 
the monitor. The monitor creates a remote process on the master's processor, and opens 
up a communication channel to it using the message passing system. From that point on, 
the monitor is used essentially as a front end for the master. 
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open execution control file 
task_list :- empty- lietO 
while not (end-of -file) do 

read task description 

append task description to task.list 
end while 

for each element "tl" in taak_list do 

for each element "t2" after t in taak_list do 

if any of tl's outputs match any of t2's inputs then do 
tl is a predecessor of t2 
t2 is a successor of tl. 
end if 
end for 
end for 



Figure 3.4: Algorithm for Generating Data Dependency Graph 

The first thing the master does is read the execution control file, which contains 
all of the task descriptions. This is all the master needs to know about the particular 
application being run (e.g. DRC or Makefile). Recall that a task description indicates 
all the input and output files, as well as the sequence of operating system commands that 
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run that task. Completing each of the tasks in the execution control file is equivalent to 
running the application. The master is charged with distributing those tasks among all 
the available processors so as to minimise the total execution time. The strategy it uses 
requires the generation of a data dependency graph from the execution control file. The 
algorithm used is presented in Figure 3.4. 

The master maintains the database of shm* A slave is created in response to a 
request that the user gives to the monitor. The monitor relays the request to the master, 
and just as the monitor created the master, the master nse# the message passing package 
to create a remcte proce» en the skve's protest 

with it The user can request a slave at any time after the master has been crested. Each 
slave has the capacity to execute one task at a time. Hence each slave can be in one of 
two states: "busy" or "idle". An idle sieves fist and a Susy sieves list are maintained 
throughout the computation. 

The computation begins with the roots of the graph. A task is a root if it has no 
predecessors. So initially, the roots are placed on a rsmfr s tuns, A-teah on the ready queue 
is said to be computable. When all of a task's predecessors are completed, it b placed on 
the ready queue. 

3.4 Task Scheduling 



/do while there are tasks left to execute 



do while (the ready queue and the free slave list aren't empty) 

assign a slave to a task 
end while 

wait for a slave to finish or a "create slave" message 

end while 



Figure 3.5: A Skeleton for a Task Sdmdnung Algorithm 

With a list of free slaves and a ready queue, the master can begin the computation. 
The basic structure of the algorithm used to control the execution is presented in Figure 
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3.5. 

Each statement in the algorithm corresponds to a substantial amount of program- 
ming. For example, "wait for a slave to finish" implies (among other things) check- 
ing the finished slave's task's successors to see if they are now computable. One statement 
which implies a good deal more is "assign a slavs to a task". If the number of free 
slaves is greater than or equal to the number of tasks on the ready queue, then we can 
assign any of the computable tasks to a slave, since each of the tasks will be assigned be- 
fore the loop falls through to the "wait. . .* statement. Unfortunately, we are not usually 
provided the luxury of being guaranteed more slaves than tasks on the ready queue. The 
choice of which task to assign must be made carefully, because it can have fairly profound 
effects on the speed-up factor of the parallel execution. 

A bad algorithm for choosing tasks can result in data dependency bottlenecks. 
An optimal algorithm for choosing tasks is A//>-Complete [Mehrotra, Talukdar 1982]. We 
present here a heuristic for choosing tasks that has been observed to perform optimally 
under most conditions. It requires a preprocessing step that has time complexity 0(n 2 ), 
where n is the number of tasks. 

The first step toward discovering this heuristic is to identify the goals of the whole 
parallel execution system, and how the task scheduling algorithm must try to help achieve 
these goals at minimal cost. The main objective is to minimise the real time (as opposed 
to the CPU time) needed to execute a set of tasks, given a finite number of processors. To 
do this, the task scheduling strategy must keep all processors busy as much of the time as 
possible. Each processor will be always be busy as long as there are computable tasks. So 
a good subgoal is to keep the ready queue as full as possible. Executing a task that has 
no successors (called a leaf) will clearly make no progress toward replenishing the ready 
queue. Executing a task that has many successors will clearly make some progress towards 
that goal, but it's still not clear how one should measure the immediacy of the need to 
execute a given task. What we do know is that we are interested in the characteristics of 
the subgraph rooted at that task's node in the data dependency graph. 

To help focus our attention on the right characteristics of a task's subgraph, we 
observe that the limiting factor of a computation is the longest path through the data 



37 



















































MO0O0M 





for each task "Tl" la task_roots do 

cospute height_of_task(Tl) 
end do 

height.of.taskCTl) : 

if Tl. height is set then return(Tl. height) 

eub.height :■ 

for all successors "T2" of task Tl do 

■ub.height :- ■ax(«nb_height.eeight_of_taak(T2)) 

end for 

Tl. height :- estiaated_execution_ti«e(Tl) ♦ sub.height 
re tnrn(Tl. height) 
end height.of .task 



Figure 3.6: Algorithm for computing height of all tasks 

dependency graph. No matter how many processors are available, the overall execution 
time will never be less than the sum of the execution times of all the tasks along the critical 
path. This sum is called the height of the graph. As the computation progresses, we seek 
to chip away at this critical path in support of our mete-goal, which is to minimize the 
total execution time. So the conclusion of this intuitive argument is that we should give 
top priority to tasks which lie on the critical path. The appropriate quantitative measure 
is the height of the task's subgraph. Using the algorithm presented in Figure 3.6, we can 
compute the height of each of the n tasks in 0(n) time. 

Using the height as a priority scheme for each task does not provide very much 
resolution. In the data dependency graph generated from a sample design rule checker's 
execution control file, the estimated execution time of each task is 1, and the heights of 
all the tasks are integer values between 1 and 8. But there is more information in a data 
dependency graph that is intuitively related to how critical each particular task is. In 
particular, the total number of tasks that directly or indirectly depend on a given task 
is relevant. In a sense, it is the measure of the total fanout of a particular task. It is 
equal to the size of the task's subgraph. The algorithm in Figure 3.7 computes the size 
of n tasks in 0(n*) time. In practice, this has been an acceptable penalty to pay for the 



38 




for each task "Tl" in taslelist do 

clear_examined(Tl) 

Tl.size :» fiiuLsize(Tl) 
end for 

clear .examined(Tl) : 

Tl. examined :» FALSE 

for each successor "T2" of Tl do 
clear_exaained(T2) 

end for 
end clear.examined 

find_size(Tl): 

if Tl.examined-TRUE then return(O) 

Tl. examined :» TRUE 

size :« 

for each successor "T2* of Tl do 
size :» size + find_size(T2) 

end for 

returnCsize ♦ estimated_execution_time(Tl)) 
end f ind_size 



Figure 3 J: Algorithm for computing the size of all tasks 

more accurate scheduling capability. In the case of design rule checking, the penalty is 
insignificant compared to the time spent doing the DRC. 

Empirically, we verify our suspicion that the height of a task's subgraph is a better 
measure of its priority than the size of the subgraph. The way to compare the performance 
of the heuristics is by simulating a parallel execution under the assumptions that each 
task takes unit time and that there are no communication costs. We then depend on 
real experiments to back up the results of the simulation. Figure 3.8 shows the parallel 
execution simulations of a data dependency graph using four processors. While this is 
only one example, by running the two simulations in your mind, hopefully you will gain 
intuition that lends support to our empirical observations. 

Now we have two numbers associated with each task: a height and a size. We use 
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Figure 3.8: A data dependency graph and its execution using height and size 

these as keys to keep the ready queue sorted: first by height and then by sise. With the 
most crucial tasks at the front of the queue, the task scheduling strategy is complete. The 
0(n*) operation to find the sises is run only once before the start of the run. Typically, 
for design rule checker's data dependency graphs, there are fewer than 200 nodes, and the 
total time spent on the processing step in the beginning is less than 30 seconds. Once the 
height and size of each node is computed, they are used to dynamically guide the scheduler 
in assigning the most urgent task to a slave whenever that slave finishes its previous task. 
The strategy performs optimally in most cases. After creating data dependency 
graphs of various shapes and sises and simulating each one with a varying number of 
processors, only one example was found in which the height/sise heuristic did not perform 
optimally: it took seven time units instead of sue. This is illustrated in Figure 3.9. 

3.5 Communications 

There are two major obstacles blocking us in our pursuit of a linear speed-up factor. 
The first is the challenge of keeping each processor busy as much as possible. For the class 
of applications that we wish to accelerate, the task scheduling strategy introduced in the 
previous section does an adequate job. While testing EPIC* application to an industrial 
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Figure 3,9: A task scheduling trial simulation where height/size heuristic is suboptimal 

design rule checker, the task scheduling behaved well. This is discussed in more detail in 
the following chapter. 

The next challenge is that of minimising the communications overhead. Since £PJC 
was designed to run on a loosely coupled multiprocessor, communications is fairly expen- 
sive. In SPJC, there are two favors of interprocessor communication: control and data. 
The mechanism used for these two forms of communication is different. 

3.5.1 Control Communication 

Control communication is accomplished using the message passing package devel- 
oped for £PJC. It is based on the VMS/DECnet task-to-task communications protocol 
[VMS 1985]. From a programmer's point of view, one limply opens a channel using a file 
specification of the form: 

node "user-name password": : "task»commandf lie" 

This causes a message to be sent on the Ethernet to node, requesting that a process 
be created for username, and that that process run coBfiandfil*. The coamandfile on 
node should then open a channel (or invoke a program that opens a channel) using a file 
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specification of the form SYSIIET:. By writing to and reading from these channels, the 
processes can send messages to each other. 

The above mechanism provides the necessary channels of interprocessor communi- 
cation in the case where one process wants to create a new process on another processor 
and then talk to it. If two existing processes want to establish a channel of communication, 
then another strategy is used. When an SPIC program (MOflTOR, MASTER, or SLAVE) is 
run, it creates a VAX/VMS mailbox [VMS 1985]. A mailbox contains a global buffer into 
which any process that knows how to find the mailbox can write a message. When the 
program creates the mailbox, it assigns a logical name to the mailbox so that other pro- 
cesses can find it. By convention, MOIITOE uses the logical name EPXCfNOHITOR, MASTER 
uses EPICIMASTER, and SLAVE uses EPICtslave-nane. Therefore, within a single logical 
name space, there can only be one monitor and one master, and each slave name must be 
unique. Thus when one program wants to contact another, it opens up a channel to the 
appropriate mailbox (for example, monitor open up a channel to node: :EPICHfASTER:) 
and initiates a conversation. By reading to and writing from that channel, the two existing 
programs can communicate. 

S.&.2 Control Communication Requirement* 

The MASTER program communicates with any number of slaves, in addition to the 
monitor. The "wait for a slave to finish or a "create slave" aessage" line in 
Figure 3.5 requires the use of an I/O subroutine that is not provided by VAX/VMS or 
the PL/I run time library. At some level in the code, then must be some statement that 
reads a record from any of several I/O channels, returning the menage and the channel 
number of the first channel to send a record. In order to provide this functionality, an 
asynchronous read request is left pending on each channel using the VAX/VMS system 
service SYSlQIO. When the channel responds, a subroutine specified as a parameter to 
SYSlQIO is called at the interrupt level. This subroutine is called an asynchronous system 
trap (AST). It is the AST's responsibiUty to append the message that was received onto a 
queue of messages, set a global event flag that indicates that a message was received, and 
requeue the SYSlQIO. 
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When designing a large, complex system such as EPIC, the existence of ASTs 
poses a tough software engineering problem. Since ASTs execute at a higher priority level 
than mainline code, we cannot generally assume atomicity in a sequence of operations that 
updates a data structure. For example, if one is in the process of deleting an element from 
a doubly linked list, and an AST is triggered that modifies that list, the list could be left 
in an inconsistent state. In short, ASTs are a power tool, an* when power tools are used 
carelessly, they can kill 4 (or at least cause endless hours of debugging) . 

There are two strategies for ensuring harmony in data structures that are shared 
between mainline code and AST routines. The first is to disable AST interrupts with a 
system call wherever synchronous code accesses a data structure that it shares with AST 
routines. The disadvantage of this approach is that while interrupts are disabled, the user 
process can't respond to messages it receives from other processes. If the sending process 
uses asynchronous WtlTEs, then it could queue up an arbitrary number of messages while 
the receiving process remains in "disaMed-intenrupts* mode. Depending on the buffer size 
parameters selected by the system manager of the computer facilities, the buffer could 
overflow. If the sending process uses synchronous WRIT&r, meaning the WRITE statement 
doesn't return until the reader's AST has been triggered, then the sender will be delayed 
until the reader's interrupts have been re-enabled, m this case, if the reader has interrupts 
disabled while waiting for the "message-received" event flag to be set, a deadlock could 
occur. 

The other strategy is to carefully code the routines that access shared data struc- 
tures so that they are never in an inconsistent state. It is possible to do this for singly 
linked lists, but not doubly linked lists. This is a fairly serious restriction, since it is diffi- 
cult to delete an arbitrary element from the middle of a singly linked list. One way around 
this is to share only a singly linked list between mainline and AST-level code. The only 
operation ASTs get to perform is appending to the tail of the list. All that the mainline 
code does with that list is remove messages from the head of the list and place them in a 
more versatile data structure that is safe from ASTs. 

The message passing facility uses a compromise between these two approaches. 
4 "Power took can kill" is a maxim credited to Brian Reid of Stanford University 
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Since SPIC requires both the capability of reading a message from the first channel and 
the capability of reading a message from a specific channel, the shared list has to support 
the ability to scan through the list and remove the appropriate message. This could 
have been implemented using the Utter strategy, but the folkming strategy was more 
convenient to code, and in practice did not suffer noticeable performance penalties. It 
shares only one structure between AST-level routines and mainline routines: a doubly 
linked list structure. Interrupts are only disabled for the tana it takes to find and remove 
the appropriate message. In practice, finding the appropriate menage in the list was 
not expensive, since the list generally had leas than 10 messages. Removing the message 
amounts to moving a few pointers. The key to making the «disable-interrupt" strategy 
work is to avoid doing any I/O calls while interrupts are disabled. 

The primary motivation for writing the nrwswge pairing package was to eliminate 
all asynchronous code from the rest of SPIC* In addition, the niwiegt passing package 
provides a uniform synchronous procedural interface for keuadhng asynchranous communi- 
cation between a procew ft and the following «rt»t»«« 

• Independent processes that »t created en another node 

• The process that created pi from another node 

.• An independent, already «"rfo"ng process on n^Vr node 

• Subprocesses created by pi 

• The terminal attached to pi 

• Timers created by pi 

The single most significant function it provides is that of reading from the first of 
any of the entities that sends a ; 



3.5.3 Data Communications 

Recall that tPIC is a shell around an existing software system. SPIC divides the 
execution of that software into tasks. Each of these tasks communicates using disk files. 
While the problems to which we are restricting ourselves do not use extremely large disk 
files, experience has demonstrated that the performance improvements we reap through 
parallelism are most severely limited by the speed with which we pass data between master 



44 






and slave. The message passing facility described above is not as fast as it could be, since 
considerable effort is spent providing the functionality required by EPIC. Hence, if we 
were to use the message passing facility for data communication, we would suffer from 
suboptimal performance. In addition, the data contained in the input and output files may 
be represented using any of the file record structures available in VAX/ VMS. The message 
passing facility is restricted to dealing with character strings. The standard VAX/VMS 
interprocessor file copying commands provide the appropriate functionality at the fastest 
possible speed. 

To copy a file from one VAX/VMS system to another, an interactive user would 
type 

$ COPY nodal "usemaMl passwordl": :davIcel:[directorylJfilel.exti - 
$_ node2"userna*e2 pMs«Qfda?: : ^4»Fl^ft4ULs^Wc^tUa2.t«t2 

Naturally, if you were typing this on nodal , you would omit the accounting infor- 
mation for it. In general, VAX/VMS allows the inclusion of a node specification (with 
accounting information) in any file specification. Opening a file with an account specifica- 
tion causes a process to be created on the remote node using the supplied username and 
password. That process efficiently handles the I/O calls made to the channel. The re- 
mote process creation is functionally transparent to the user, except for the time overhead 
involved. 

The way £PJC executes the VAX/ VMS COPY command is by using the message 
passing facility. The facility provides a call that creates a subprocess and keeps it around. 
Sending a message to the subprocess causes the text of the message to be interpreted 
as a VAX/ VMS command. When the command finishes, a message is "sent" from the 
subprocess to the main process. This way, the main process can be doing other things 
while the subprocess is executing the command. 

Each slave is responsible for bringing its task's input files from the master's filesys- 
tem to its own, and for sending back the output files when a task is completed. Buffering 
all the data files on the master is obviously less efficient than having each slave trans- 
fer its task's input files directly from the slave that generated them. £ PIC's approach 
has as much as twice the file transfer overhead has the optimal approach. The reason 
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£P JC buffers all data filet at the master is so that if a slave's processor crashes, then its 
work won't be lost. In the class of problems for whfch t P IC was designed, the cost of 
re-executing one task may be greater than the total cost of all the file transfers for the 
execution of every task. 

SPIC spends some effort trying to minimise the number of file transfers. The 
master keeps a database of all the files that reside on each stave's fiksystem. Whenever 
a task is assigned to a slave, it is told which of the input fifes it already has, so the slave 
can suppress the COPY command. The effectiveness of this strategy is further enhanced 
by modifying the task scheduling algorithm to take into account what input files for each 
computable task are already resident on a free slave's fiksystem. Specifically, the ready 
queue is composed of a list of tort group*. Each task in a given task group has the same 
height, but varying sixes. The groups are arranged in decreasing order of height, and the 
tasks within each group are sorted in decreasing order of sise. When a slave becomes free, 
the first task group is scanned to find the task that will require the fewest file transfers to 
execute. Thus the task scheduling strategy is based on ordering the ready queue by three 
different characteristics of each task: 

1. The height of the task's subgraph (computed once) 

2. The number of input files that the slaps almady has (computed on the fly) 

3. The sise of the task's subgraph (computed once) 

3.6 Fault Tolerance 

When the word "timesharing" is mentioned to someone who has recently survived 
an undergraduate Computer Science curriculum, the image that first enters his mind is 
that of an overloaded CPU. £PICa dynamic task scheduling algorithm insures that a 
relatively heavily loaded processor will be assigned proportionally fewer tasks. Another 
"timesharing" flashback is that of the downed computer. In those days, when the CPU was 
down, it was of course no longer possible to get any useful work done (except maybe a trip 
to the vending machine). With distributed computation, if one processor goes down, the 
execution should gracefully continue with degraded performance. By outlining a typical 



46 



scenario, the need for this requirement will gain more substance. Assume EPIC is being 
used to accelerate the DRC of a chip that might ordinarily take several days on a single 
VAX 11/780 computer. Ten VAX computers are being used to (hopefully) finish the DRC 
overnight. If one of them crashes (or is brought down for pnsventive maintenance), EPIC 
ought to continue the computation at 90% <rf IW former speed. If EPIC gives up its 
unmanned computation, the layout designer may fall behind a whole day, assuming the 
ten VAX computers will be far too loaded for long non^nteractive jobs during working 
hours. 

Giving EPIC the capability to handle crashed slaves is fairly straightforward. The 
scheduler doesn't statically ■ prepartition the set of tasks, it just assigns priorities to them 
so they can be easily assigned to slaves on the fly . ff the message passing facility detects 
that a slave crashes while it is running a task, that task is placed back in the ready queue 
according to its priorities. If the slave completed any tasks before crashing, the output 
files are buffered in the master's file space, so the work won't have to be redone. 

At any time during the course of a parallel computation, the user can go into 
MONITOR and create another slave. Again the dynamic task scheduling algorithm makes 
it easy. The new slave is added to the master's slave database, and (recall Figure 3.5) is 
immediately assigned a new task. Thus if the user is watching when a slave crashes, then 
when the machine is brought back up, the user can restart the slave process. 

A predecessor to EPIC called PDRC (Parallel Design Rule Checker) experimented 
with a mechanism to periodically probe a crashed slavefrprocessor to see if it had come back 
up [Marantz 1984]. When the processor responded, POM would automatically regenerate 
the slave. This worked well most of the time, but became very #ustrating while debugging. 
If a slave was misbehaving for any reason, terminating the proems would be futile, since 
PDRC would immediately sense that the processor was still up, and would create the slave 
again. Nevertheless, this functionality should eventually be brought into EPIC. 

Currently, EPIC is not capable of continuing a computation if the master's pro- 
cessor crashes. It is, however, capable of restarting the parallel execution where it left 
off. After the master first reads the execution control file, it goes through a process of 
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/^ check_tasks :■ taak_roots 

while "checlctasks" is not eapty 

Tl :- first element of chock. tasks 
rsaoTS Tl froa check_taska 

if all of Tl's output f lis* exist then 

if all of Tl»s output filos wore last revised 
af tar each of tl ' s input files then 
call task_fini«hed(Tl> 

end whils 

task_f inishad(Tl) : 

for sack succassor "T2" of task Tl do 

T2.predecessors_co»pleted :■ 1 ♦ T2.predecessors_conpleted 
if T2.predece— ors.cesq&ated m T2. pr ede cess ors than 
append T2 to checlctasks 
•nd for 
and task_f inishsd 



Figure 3.1(k Algorithm for determining which tasks have already been done 

eliminating tasks in a manner very similar to that of Unix* Makefiles (and VAX/VMS 
MMS 6 ). The algorithm used is presented in Figure 3.10. 

For most applications, it would be sufficient to merely check for the existence of a 
task's output files in order to mark it as complete. But since it was not hard to compare 
the revision dates of the Input and output ties, end since doing so gives SPIC the basic 
functionality of uake, it was implemented. Thus giving € f IC the functionality of make 
was as easy as converting the syntax of the Makefile to that of the execution control file. 

'Unix » * trademark of ATfcT BeU Laboratories 
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3.7 Error Recovery 

EPIC tries to address the problem of how to proceed when a slave's subprocess 
fails to properly execute the task it is given. A failure of this nature is detected in one 
of two ways. The message passing system will return the VAX/VMS error code if a 
problem was detected by the program run in the subprocess. If the program is not a 
VAX/ VMS layered product, the error code may not say very much, but hopefully even an 
independently written program will abort by signalling an error rather than terminating 
normally. EC AD DRC, for example, behaves in this manner while remaining portable to 
other operating systems by dividing by aero whenever a problem is detected. The other 
way an error is detected is by checking for the absense of any of the task's output files 
when the task's DCL commands are finished. 

In the past, the cause of an unsuccessful task execution has stemmed from a variety 
of sources. Sometimes the error is a reflection of the state of the computational environment 
of the slave's node. Specifically, a library file or executable image could be missing from 
a system directory. Sometimes the error is due to a possibly transient condition on the 
slave's node, such as the lack of a resource needed to execute the task. Often, when one 
slave failed to execute a task, another was found to be capable of completing it. 

The strategy implemented by EPIC is to put a failed task back on the ready queue, 
and keep track of how many times it has failed. When this number reaches a certain 
threshold, currently defined to be 3, the task is deemed uncomputable, and is removed 
from the data dependency graph, along with all the tasks in its subgraph. 

For certain potential applications of EPIC, the cause of failure for any task is 
be more likely to be illegal or erroneous input files. This is most likely the case when 
the application is to compile and link software. If EPIC detects a failure in a source code 
compilation, it is a waste of time to try it again three times before deeming it uncomputable. 
The right solution is then to reduce the task failure threshold to 1. The first time a task 
fails, it will be removed from the data dependency graph, and the rest of the tasks will be 
executed normally. 

Each slave also gets a counter, which is incremented whenever it fails its task and 
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decremented whenever it completes its task. If this counter crosses a threshold, currently 
denned as 2, then £PIC destroys the slaves on the grounds that it is a waste of time to 
be assigning tasks to it if its going to fail more tasks than it completes. 

This computer resource management strategy is analogous to human resource man- 
agement. A manager will assign the most responsibility to his most productive employees. 
£ P I C's strategy could be extended to use more resolution in an attempt to imitate human 
managers. Currently, each slave is essentially treated as an equal. Slaves are picked from 
the "idle slaves" list to execute the highest priority task, If there is more than one slave 
in this list, then the slave that has cached the greatest percentage of the highest priority 
task's input files gets the job. It would be interesting to implement a scheme where the 
slaves were ordered according to their past productivity. When selecting a slave, weights 
would be placed on the number of files it already has, the number of tasks it has completed 
so far, and the number of tasks it has failed so far. 

3.8 VAXcluster Support 

A VAXcluster is a group of up to sixteen VAX computers connected to a single 
file system. Thus the file system looks exactly the same when you are logged into any 
VAXcluster member. EPIC supports the use of VAXchnters. By issuing a command to 
MOMITOR, a user can specify a fist of node names to define a VAXcluster. A database is 
maintained to keep track of where all the relevant data files are in the network. The struc- 
ture of the database reflects the file sharing between VAXckstered nodes, and provides 
for any number of discrete VAXclusters: 

network-database - list of VAXcluster-databases 
VAXc luster-database - list of f lie-specifications 

The f ile-spscif ication in the VAXc luster-database cannot include a "node : : " 
specification, but can include a device or directory. Computers that are not VAXcluster 
members are represented in the database as single-node VAXclusters. Thus an arbitrary 
environment of VAXclustered and independent nodes is supported. 
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Each slave entry in the master's database contains a pointer to the slave's node's 
VAXcluster. So if the master and a slave are on the same VAXcluster and are connected 
to the same device and directory, the master will knew that the slave will never have to 
copy an input or output file. If they are on the same VAXcluster but connected to different 
devices or directories, the master will know to instruct the slave to use a local file transfer, 
and thereby save the overhead of creating the foreign process and moving the file over the 
Ethernet. If two slaves on the same VAXcluster share the same device and directory, the 
master will understand that they share the file space, and that one slave will never have 
to copy a file that was created or copied by the other. As of now, no advantage will be 
gained from two slaves on the VAXcluster with different devices or directories, unless the 
master is also on their VAXcluster. 

Thus it is highly advantageous to have each slave on a VAXcluster running out 
of the same directory. If the master also uses that directory, then there will be no data 
transfer overhead for those slaves. This eliminates the single most significant bottleneck 
in the parallel execution. 

The only legitimate motivation for running VAXclustered slaves out of different 
directories is if the application software has naming conflicts with temporary files it uses. 
Two processes running the same application program may both be tiying to read and write 
a temporary file of the same name. By running the two processes out of different default 
directories, the naming problem is resolved, and EPIC will still run, albeit with more data 
transfer overhead. Another motivation is as a workaround to a bug that may exist in the 
application software. If a single input file is used by two tasks, and both those tasks are 
executed at the same time by different CPUs in the same filespace, then the second process 
to open the file is subject to a file locking error. In VAX/VMS, any number of processes 
can open a file for read access. But if one process opens a file for read/write access, any 
other process attempting to access that file will get a "file locked" error. The problem 
occurs when a program that is only interested in reading the file erroneously opens it for 
read/write access. 
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3.9 Performance Monitoring 

In order to rapport the claims made about the effectiveness of the task scheduling 
and file transfer optimisations, it was necessary to generate statistics for each CPIC run 
concerning the breakdown of where each slave's time was spent. For the purposes of 
performance monitoring, each slave is always in one of four states, as described below: 

EXEC: Executing a task. 

FILE: Transferring an input or output file. 

IDLE: Waiting for a task to become computable, but not FREE. 

FREE: 1. The execution is in its first stages, and the data dependency graph hasn't 
widened enough to allow all slaves iii begin doing useful work. 

2. The execution is in its last stages, and there are no more task* left to execute. 
The execution will be finfehed as soon as the last slave that is executing now 
finishes its current task. Free slaves are not killed because if an executing 
slave's processor crashes, a free skvrs fehoald be available to take over the 
task. 

The distinction between "FREE* and "IDLE" is motivated out of fairness to the 
task scheduling algorithm. We are interested in identifying those times when a slave 
remains idle due to an unwise task scheduling decision. Typically; data dependency graphs 
have a small number of roots, but widen out quite a bit to reveal more parallelism. There is 
nothing a task scheduling algorithm can do to keep all the slaves busy during the execution 
of the roots. Additionally, at the end of the computation, it is impossible to keep each 
slave busy if there are no more tasks to execute. Thus the slave is classified as "FREE" if 
the cause of its inactivity is not a scheduling decision. IDLE" time is what we want to 
keep track of to judge the task scheduling performance. 

Each slave is responsible for keeping track of its own performance statistics. A 
performance monitoring subroutine package was built using VAX/VMS system services 
for keeping track of the various counters for CPU time and elapsed time. The slave uses 
the message passing facility to spawn a subprocess to do the file transfers and execute 
the VAX/VMS commands used to execute each task. Thus the SLAVE program runs in a 
separate process from the slave's task, and is free to spend whatever time it needs to keep 
track of the subprocess. 
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Periodically, the slave sends the master a one line summary of its progress. The 
master then relays this information to the monitor, which displays the information on the 
user's screen. The user can control the period at which each slave sends the information 
by issuing a command to the monitor. 

At the end of the computation, each slave sends a detailed summary of its statistics, 
including: 

• The total CPU time and elapsed time it spent in each of the four states. 

• The number of tasks it executed. 

• The number of tasks it failed to execute. 

• The number of files it transferred. 

• The number of files it avoided transferring due to file transfer optimization. 

• VAX/ VMS Statistics such as virtual memory usage and page faults. 

The master takes each summary that the slave provides and formats it into a table. 
In addition, the master makes its own contribution to performance monitoring. Whenever 
a task is started or finished, the master notes the current time and the name of the task's 
slave. At the end of the run, it generates a graphical journal of how the run progressed. 
The graph is organized by assigning a vertical column to each slave. Each column contains 
a series of diamonds which represent the tasks executed by each machine. The height of 
each diamond is directly proportional to the time it took to execute the corresponding 
task. Arcs are drawn between diamonds wherever a data dependency exists between the 
diamond's tasks. The left edge of the graph is scored with labels indicating the elapsed 
time at that vertical point on the page. 

There are two useful pieces of data to be gleaned from that graph* It gives us an 
intuitive feel for how the execution was distributed among the available processors. In 
addition, vertical space between the diamonds in any column indicates that that column's 
slave was either idle or transferring files during that time. The slope of the arcs ending 
at the lower diamond gives us intuition about the reason for the space in between the 
diamonds. A nearly horizontal line indicates that the slave was sitting idle waiting for 
a task to become computable. A line with a greater slope indicates that the slave was 
waiting for the input files to the task to be shipped over the network. 
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Appendix C contains examples of summary tables and graphs for several runs of 
£PIC. 



3.10 Results 

No conclusions can be drawn about the overall performance of £ PIC without ref- 
erence to a specific application. The following chapter discusses the application of £ PIC 
to VLSI design rule checking, circuit extraction, and Makefiles. 

3.11 Future Extensions 

In this section, several extensions to SPIC are contemplated. A fairly straight- 
forward extension is to delete intermediate files as soon as they are not needed. This is 
not difficult to implement, except when it is combined with the another straightforward 
extension, which is to avoid buffering intermediate files at the master. The buffering pro- 
vides a redundancy that is needed to avoid repeating work that is lost due to a crashed 
slave. If both these extensions are implemented, and if a slave crashes, we may find that 
we have -burned our bridges behind us": the files needed to redo the slaves work may not 
exist anymore, possibly forcing us to pop back to the roots of the data dependency graph 
and effectively start over. The motivation for these extensions is discussed in the following 
chapter. 

Another extension is to bring more intelligence into the choice of which slave to 
assign to the highest priority task. Most of the time, there are plenty of tasks to execute, 
and the master is waiting for a slave to finish its current task. But data dependency graphs 
that have narrow sections, such as the initial separation stage of a "divide and conquer" 
application, may be run more efficiently if the most powerful computer is used for the 
bottleneck task. 

One flashy feature that would be relatively easy to add is the ability to revive old 
slaves whose processors crashed and were then brought back up. As mentioned before, 
SPIC* predecessor, PDRC, had this capability. 
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A more substantial extension addresses the problem of continuing the computation 
even if the master's processor crashes. It involves the use of shadows: A shadow runs on 
a different processor from the master, though it could share a processor with a slave. It 
maintains a database of slaves and tasks. Using the message passing facility, it monitors 
events as they happen on the master and updates its database accordingly. If the message 
passing facility detects that the master has crashed, the shadow contacts the slaves and 
takes over control of the computation, thus becoming the new master. If the master and 
shadow are VAXclustered together, then the transition is conceptually straightforward, 
since the master's buffered files are still accessible. If they do not share a VAXcluster, 
then the shadow must actively copy the master's buffered files as they are created. 

Shadows were not implemented in £ PIC due to lack of time. However, it is unclear 
whether they would actually be used in practice if they existed. They help make £PIC 
fault-tolerant by adding redundancy, but in the case of VAX computers that are not 
VAXcluster members, they do this at a considerable cost of disk space. 

Another substantial extension attempts to reduce the penalty of data communica- 
tion. The concept is analogous to that of instruction prefetch. Based on the observation 
that network file transfers are more I/O bound than compute bound, £PIC would attempt 
to predict what task a slave would execute before the slave finished its current task. The 
slave would then retrieve the next task's input files in a separate process. Presumably, the 
slave's execution process and file transfer processes would not detrimentally compete for 
cycles within the slave's processor, because they use different resources. 

Another related technique is delayed reporting. Currently, when a slave completes 
the execution of a task, it immediately proceeds to transfer the output files back to the 
master. Only when the transfer is complete does the slave notify the master that it is ready 
to execute another task. By notifying the master as soon as it is finished with the execution 
of its current task, the slave can be assigned a new task while it is still transferring the 
old output files. This approach is most effective if the slave already has the files it needs 
to execute the next task. Hence it is an ideal companion to data prefetch. 

Data prefetch is difficult to implement because it involves predicting the best task 
to give to a machine when the execution is in some future state. The use of this technique 
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would most likely require altering the task scheduling strategy. While these enhancements 
are interesting topics for future research, the potential gains will diminish as VAXclusters 

V 

become a more popular vehicle for coarse multiprocessing. 
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Chapter 4 
Applications 



This chapter discusses several applications of £ PIC. Methodologies are presented 
for automatically generating an execution control file for each application. Results are given 
for various cases of each application run on severs! different multiprocessor configurations. 

A comparison is made between £PIC with ECAD's DRACULA serial DRC program 
and EC AD a Parallel DRACULA. 

4.1 Design Rule Checking 

The challenge of adapting DRACULA to be distributed over a network of VAX com- 
puters using £ PIC lies in generating the execution control file from the ECAD rules file. 
In order to do this, we have to understand the mer.hanrs of how DRACULA is normally run 
on a single VAX computer. The VLSI process engineer defines the geometric design rules. 
The VLSI layout designer lays out the chip according to the design rules, thus generating 
a file in some standard layout description language, such as CIF [Mead, Conway 1980} or 
GDSII 1 . A programmer must then specify the process engineer's design rules in the lan- 
guage defined for that purpose by ECAD. These rules are fed to ECAD's preprocessor, 
PDRACULA, which generates the VAX/ VMS command file which runs all the VAX/ VMS 
executables that implement the statements in the rules file, hence running the DRC. Typ- 
ically, the command file is submitted as a batch job. 

1 GDSII is a trademark of G.E. Cabna Corporation 
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To maximize efficiency, ECAO rearranges the statements in the rules program If 
any individual DRC program is called by mora than one rules statement, then ECAD's 
preprocessor tries to execute those statements together with one call to the program (while 
obeying data dependency constraints) and thereby minimise image activations. Depending 
on the value of a switch set in the rules file, the preprocessor may attempt to rearrange the 
order of execution of the rules program statements and delete temporary files to minimize 
peak disk space usage. 

Unfortunately, all these optimisations deplete the extent of the parallelism by in- 
troducing new data dependencies. By deleting intermediate disk files after they are used, 
the preprocessor introduces a new constraint restricting the order of the execution of the 
rules statements. But the philosophy behind iPK is to use whatever hardware you have 
available to solve a specific problem as quickly as you can. We are willing to sacrifice disk 
space in order to achieve maximal speed. It is worth netissi that SPJC may not be able 
to DRC large chips if there is just enough disk space U do a serial ran using ECAD's 
optimized file deletions. 

As mentioned in the previous chapter, it would not be hard to modify tPJC to 
optionally delete intermediate files once they are not needed. This would bring SPICa 
peak disk space usage down considerabty. But since fPJC schedules so as to minimise 
execution time, rather than disk space, it stUl wouldn't *e is stingy as an optimized serial 
DRC. To further dose the gap, cVJC could he modlied to avom storing every mtermediate 
file on the master's filesystem. Instead, a slave would copy its task's input files mrectly 
from the slave that produced them (or from the master * the task is a root node in the 
data dependency graph), lather than having the sla ve copy Hi task's output files back to 
the master, the master would just iiote where the fie resides. ffJC could then copy the 
final output files (such as the DEC error summary and si.ye«t files) back to the master's 
filesystem. As mentioned in the previous chapter, this would cut down the file transfer 
overhead by as much as a factor of two. The disadvantage ■ that aereshed slave's previous 
work would have to be redone. 

A more practical consideration about the preprocessor is that it's rearrangements 
of the command file make it mechanically difficult to identify the VAX/ VMS commands 
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I Fragment from a DRC rules program: 

AND PQLY DIFF GATE ; Figure out the gate area 
\^ WIDTH GATE LT 4.0 OUTPUT GWID 32 32 ; Gate witith >- 4u 



J 



Corresponding execution control file fragment: 

task "AND POLY DIFF GATE ; Figure out the gate area"- 
/IHPUT -(POLY. DAT. - 
DIFF. DAT) - 
/OUTPUT- (GATE. DAT) - 
/DCL- C$«SYS$LOGIN: M0SIS.COM 32") 

task "WIDTH GATE LT 4.0 OUTPUT GWID 32 32 ; Gate width >- 4u"- 
/INPUT - (GATE. DAT) - 
/OUTPUT- (GWID32 .DAT) - 
/DCL- ("$«SYS$LOGIH: M0SIS.COM 33") 



Corresponding execution command file fragment: \ 

$GOTO 'PI' IJump to the task number specif ied as first parameter 
$ ! 

$32: !AND POLY DIFF GATE ; Figure out the gate area 

$RUN SEGCADIECAD: LOGICAL 

2 POLY DIFF GATE 1000 MIC 

IEXIT 

$33: ! WIDTH GATE LT 4.0 OUTPUT GWID 32 32 ; Gate width » 4u 
$RUN SEGCADIECAD: SPACING 

1 GATE GATE 0.000 4.000 MIC 1000 OS 

00000000 00 
NOT-CONJUNCTED 

1 GWID32 GWID32 32 32 100 

$EXIT 



Figure 4.1: MOSIS CMOS DRC rules fragment, ECF fragment, and COM fragment 
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needed to execute any particular task in the rules lie- Pot this reason, directly decomposing 
the preprocessor's command file was not a successful strategy. 

A better approach is to decompose the rides program and run the preprocessor 
separately on each statement. Every command file generated by the preprocessor is parsed 
to remove the extraneous initialization and error merging code. The remaining text from 
each command file is used to construct a single command file. To execute a single task in 
the execution control file, this command file is invoked so as to execute the correct segment 
of code. A preprocessor was written to automatically convert a DRC rules program into 
an execution control file and an execution command file. It is called ECAD2ECF. Figure 4.1 
shows the output of ECAD2ECF for a fragment of a rules program written to design rule 
check VLSI designs layed out using the 4m MOSES CMOS process [Mead, Conway 1980], 

Two stages of the DRC are not covered by the tasks described in BCAD2ECF's ex- 
ecution control file. It is not clear whether the initial separation of each layer from the 
layout file is an inherently parallel operation. This operation is most likely implemented 
by examining the whole layout in one pass, appending to a pven layer file whenever it en- 
counters geometry for that layer. One thing that is clear about this initial stage is that the 
input file is large, once it contains the geometry for every layer. It would not be efficient 
for a slave to move this file across the network, perform the initial separation, and copy 
all the layer files back to the master. Instead, this stage is executed by the master, using 
a subprocess. Further preparation of each layer is described in the execution control file, 
and executed normally by the slaves. This preparation includes the full instantiation of 
the geometry in the la$er, a polygon sorting step, and the merging together of overlapping 
polygons. 

Similarly, the final stage of the DRC is executed by the master's subprocess. This 
stage involves compiling the information generated by the execution of each rule into a 
summary file and an error layout file. Conceptually, this step could be done in parallel 
by merging together the individual error files in a binary tree. If each error file has to 
be shipped over the network to a slave, this would probably not save any time. Using 
a VAXcluster, there is more of a potential gain. Unfortunately, there is no way to do a 
multi-stage merge using the DRACULA programs. The input files for the summary programs 
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are data files with an unknown record structure, and the summary files can't be converted 
back to the input format. 

4.1.1 Predictions 

According to the data dependency graph (Appendix B) for Digital's CMOS process 
rules, the maximum extent of parallelism is very high. After the execution of the tasks 
in the top row of the graph, which do the initial preparation of each VtSI layer, and 
the execution of the tasks in the second row df the graph, which mask out the geometry 
that is not to be checked, there are many tasks whose output* are not used as inputs by 
any other tasks. Those correspond to simple DRC rules such as single-layer width and 
spacing checks. We call them "terminal tasks". SPIC'a task scheduler does very well in 
the presence of a large number of terminal tasks. They axe computable early on in the 
computation, but their execution can be delayed until a processor has nothing else to do. 
They help "fill in the gaps" of processor idleness. 
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Figure 4.2: Optimistic analysis of DEC CMOS rules based on data dependency 

There are 125 tasks in the CMOS data dependency graph. Assuming that each 
task executes in one tick of time, a serial DRC will run in 125 ticks. If there are no 
communication costs, then with two processors, the job can be run in 63 ticks. As the 
number of processors grows, the data dependency will begin to constrain the maximum 
speedup we can hope to achieve. This is illustrated in the graphs on the corner of each 
page of the thesis (see the Preface), and in Figure 4.2. 

The most striking feature of this chart is that it indicates that up to fourteen ma- 
chines can be almost fully utilized in a parallel DRC. The analysis neglects communications 
overhead, but that is not why it is overly optimistic. The fault lies is in the assumption 
that each task takes unit time. Depending on the VLSI layout, the checking of rules that 
deal with active area or polysilicon might require the examinination of more complex geo- 
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metrical structures than the checking of rules that deal with well area or diffusion implant. 
EPIC's task scheduling algorithm is equipped to deal with nonuniform task execution 
estimates, but ECAD2ECF does not provide the estimations. It would be interesting to sta- 
tistically determine good estimates for the execution time of each task. Unfortunately, 
time did not permit this. 
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Figure 4.3: Optimistic analysis of MOSES CMOS rules based on data dependency 

The MOSIS CMOS design rule set is much simpler than DEC's, and hence is imple- 
mented in fewer rules file statements. Thus thete fa not as much potential for parallelism. 
This is balanced by the tact that for a chip of any given complexity, it is far easier to check 
the MOSIS rules than the DEC rules. The analysis of the MOSIS rules is in Figure 4.3. 

4.1.2 Testing 

Obtaining consistent results for EPIC/tMSSOLk has been difficult. We are more 
interested in the elapsed time of a DEC run than we are in the cumulative CPU time. Since 
the "multiprocessor* used for the test runs is just a set of timesharing VAX computers 
which are all connected to Digital's local Ethernet, the response time of both the network 
and the system has been unpredictable. Even late at night, many of the systems are loaded 
with batch jobs and high priority file system backups. 

Several steps were taken toward minimising external factors that could alter the 
elapsed time for a test* Exploratory test runs were conducted at various times during the 
day, indicating that the computers were most responsive very early in the morning. Each 
result presented here was taken from the best of several runs on a particular multiprocessor 
configuration. In addition, we tried to make the test results at least partially immune to 
the timesharing competition of other batch jobs by running at a higher priority. 
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Nevertheless, the slaves typically received less than 80% of the CPU, as determined 
by the ratio of execution CPU time to elapsed execution time for every slave. Various 
factors contribute to this that may or may not be related to the parallel processing scheme. 
Page faults, for example, can be caused by timesharing competition for physical memory, 
which is unrelated to EPIC. On the other hand, page faults can also be caused by the 
increased number of image activations incurred due to the subdivision of the DRC job. The 
runs on DECnet suffer even more, because DRC program m vocations are often interspersed 
with file transfer commands, possibly causing the DRC program pages to be swapped out. 
It should be noted that since the measured CPU percentage was generally greatest for 
the serial runs, the observed speed-up factors may be smaller than those that might be 
achieved using IP It on a single-user multiprocessor. 

The number of processors available for testing was limited, since several of the 
group's computers were recently upgraded from VAX 11/7S0**® VAX 11/785 computers. 
From a software point of view, the upgrade is very transparent. The only noticeable change 
is the improved response time. But to make a meaningful statement about the speedup 
factor IPIC provides to DRACULA, we need to compare the elapsed time for a parallel run 
on a fixed number of identical processors to the elapsed time for a serial run on one of 
those processors. 

MicroVAX computers provide one possible alternative. They are starting to prolif- 
erate in quantity throughout the Hudson plant and it is possible to get exclusive access 
to them at night. So assuming they all have the same amount of physical memory, their 
performance should be fairly predictable. Unfortunately, most MicroVAX computers are 
configured with far too little disk space and paging file space to run a substantial DRC. 
Small DRCs aren't very informative, since the amount of time required to execute each 
task becomes small enough so that the commimications oveshead is substantial. Since 
EPIC is geared toward accelerating the verification of much larger chips, data gleened 
from DRCs run on the available MicroVAX computers will be overly pessimistic. 

Sufficient resources were not available to fully test my predictions for the maximum 
extent of parallelism in DRC. A VAXcluster with six machines was available for testing 
during off hours, but it consisted of three VAX U/310 computers, two VAX 11/785 com- 
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puters, and one VAX 8600 computer. In addition, three VAX 11/780 computers connected 
by Ethernet were available. Six Micro VAX n computer* were also available, but were not 
generally capable of DRCing my benchmark. 

The results presented here consist of EPIC runs using up to three VAXclustered 
VAX 11/780 computers and up to six independent VAX 11/780 computers. The inde- 
pendent VAX 11/780 computer tests were accomplished by not informing EPIC that the 
three VAXclustered computers shared the same nksystem. File transfers were made with 
DECnet protocol, so the tests suffered the same overhead that would have been incurred 
if the computers had not been VAXclustered together. The elapsed time from these tests 
is compared to the elapsed time for a serial run on one VAX 11/780 computer. By test- 
ing how well EPIC performs using just one p ro cess or, we attempt to isolate the control 
communications overhead incurred due to EPIC. 

EPIC* raw elapsed times are measured from the time the MASTER program is 
invoked to the point after the run when the last slave is killed. We also give the average 
percentage of slave time dedicated to task execution, file transfer, and idle time. As 
discussed in Chapter 3, the idle time does not include the time at the beginning and end 
of each run when there is no work for the slaves to do. Finally, we give the ratio of the 
slaves' total execution CPU time to elapsed execution time, which provides a measure of 
how much our results suffered due to competion for the CPU. 

In addition to analysing the raw elapsed times, we try to determine why the perfor- 
mance didn't quite match the speed-ups predicted in Figure 4.2. Those optimistic figures 
didn't take into account the time required to split the chip into its constituent layers or the 
time required to merge the error reports back together. These times are subtracted from 
the raw elapsed times and the analysis is repeated using the modified data. The remain in g 
non-linearities are small enough to be accounted for by SPIC'b overhead, and by other 
factors that are difficult to control, such as competion for the CPU, page faulting, and an 
increased number of image activations. 

According to the tests in Figure 4.4, EPIC offers a significant performance en- 
hancement over serial DRACULA. I was able to try ECAD's Parallel DRACULA on three 
VAXclustered VAX 11/780 computers using the same benchmark. The tests indictated 
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Figure 4.4: Result! and analysis of DEC CMOS DRC tests 

a speed-up of 1.8 using three machines. This is less of a speed-up than was reported in 
ECAD's article [Nielson 1986], which reported a speed-up of XJ$ fox two machines. This 
disparity may be due to excessive competition for the CPU, a factor that was difficult to 
determine because the ECAD controller runs several jobs simultaneously on each proces- 
sor. On the average, the ECAD jobs each got 43% of the CJPU, but there were typically 
two or three jobs on each processor at any given time, so it was difficult to determine how 
much the DRC was slowed by timesharing overhead. 

On the same benchmark, with the same hardware configuration, SPIC demon- 
strated a speed-up of 2.6. This is not conclusive, however, and we suspect this data 
doesn't tell the whole story for two reasons. ECAD's results were most likely based on 
the DRC of a larger chip than the one used for this benchmark, which reduces the relative 
overhead of submitting a new batch job for each task. The competition for the CPU was 
possibly an important issue, but it is difficult to determine the extent of its effect. 

In addition to the difference in runtimes between the SPIC and ECAD benchmarks, 
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the tests indicate the relative versatility of EPICs approach. Only three homogeneous 
VAXclustered processors were available, but six homogeneous processors were available 
through DECnet. Since EPIC is capable of using DECnet without VAXclusters, we were 
able to perform tests using more processors. The relative availability of VAXclustered ver- 
sus independent computers at DEC may indicate that £ P IC is more generally useful than 
Parallel ECAD. This may become less important if VAXclusters become more prevalent 
in the future. 

There are several more interesting pieces of information that can be gleaned from 
the data in Figure 4.4. First, we mention that since we ran the master and one slave on 
a single processor, there was a nonlinearity in the DECnet tests. EPIC notices when the 
master and a slave are running on the same processor and >s this knowledge to "short- 
circuit* that slave's DECnet file transfers with a local $C0PY command. The impact of 
this short circuiting can be seen by comparing SLAVEl's file transfer times with those of 
any other slave, on all the charts of £PJC/DECnet tests in Appendix C. 

For all the DECnet tests, the file transfer time rose with the number of processors. 
Not enough data is present to determine the relationship between the file transfer overhead 
and the number of processors (i.e. linear, polynomial, or exponential). 

A definite pattern was not observed for the the idle time overhead, but it never 
exceeded 1.5%. In the tests made here, no slave was ever idle for lack of work to do. 
Idle time accumulated due to network message passing latencies. We would expect the 
absolute message passing time to remain unaffected by the number of processors, since 
the number of tasks remains constant. Naturally, since the elapsed time of the DRC 
shrinks as the number of processors grows, we would expect the relative overhead of the 
message passing latency to increase. But the dominant factor in message passing latency 
is probably network congestion, which varies greatly over time. As discussed in Chapter 
3, the VAXcluster runs use DECnet for control communication, so they are also affected. 

The tests run here indicate that the speed-up factor was beginning to fall off as the 
number of processors increased to five or six. This is expected in the DECnet tests, since 
the data communication overhead increases with the number of processors. It is likely 
that the we will not be able to use fourteen independent processors to achieve our goal of 
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completing a DRC as fast as the data dependency will allow. Nevertheless, the results are 
good enough to justify the additional hardware expense in a production environment. On 
the other hand, the three processor VAXcluster test results were sufficiently promising to 
warrant additional experimentation. It would be intriguing to see how many VAXclustered 
processors we can use before the speed-up factor begins to fall off. 

Both the DECnet and VAXcluster results may have been more optimistic if the test 
case used a larger chip. Since the execution time of the DRC tends to grow faster than the 
size of the files, the data communication overhead would probably become less significant. 
The control communication overhead would vanish quickly, since it grows with the size of 
the design rule set, not the chip size. 

So far, with up to six processors, SPJCb task scheduling strategy has been es- 
sentially optimal. If as we increase the number of processors, inefficient task scheduling 
becomes a bottleneck, we will probably be able to improve the task scheduling by supplying 
statistical estimations of the length of each task, based on previous runs. 

Thus £PIC potentially offers the mechanism to run DRCs as fast as the critical 
path through the data dependency graph will allow. To achieve this goal, we need to do 
the following: 

• Use more VAXclustered processors. 

• Obtain exclusive access to them, so the test results will be repeatable. 

• Develop statistical estimations for the execution time of each task, so task scheduling 
will (hopefully) not be a bottleneck. 

The difficulties I encountered while running DRCs on Micro VAX computers do not 
represent an unsolvable problem. By configuring them with enough physical memory and 
disk space, a group of MicroVAX II computers connected by a dedicated Ethernet would 
work well as a low-coat, high-performance DRC server. If ten MicroVAX n computers 
can offer an 7x speedup for DRC (the optimistic analysis indicated 9*6), then they offer a 
faster turnaround time than one VAX 8600 co mpu te r (which runs roughly 5 times as fast 
as the MicroVAX II computer), for roughly the same monetary cost. 
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4.2 Circuit Extraction 

Digital's circuit extractor [Tarolli, Herman 1963] has been adapted for parallel ex- 
ecution using £PIC in a system called MACE (a Multiprocessing Approach to Circuit Ex- 
traction) [Levitin 1986]. MACE attempts to take ad vantage of the geometric locality of VLSI 
by dividing the layout into swaths (strips) which are processed in parallel. Unfortunately, 
this is a much more difficult task than it is for design rule checking [Bier, Pleszkun 1985]. 
It is not clear how to correctly handle the case when a swath's border crosses a transistor. 
However, by carefully choosing the swath boundaries, it is sometimes possible to avoid this 
case. For a chip of sufficient size, it may not be possible to draw a straight line across it 
without hitting a transistor. For this reason, MACE has only been tested with relatively 
small cells. As stated before, £ PIC in geared for larger scale problems so that the overhead 
of control communication becomes negligible. 

The results as of this writing have not indicated a significant speed-up. The layouts 
were partitioned into two swaths. The extraction was performed separately on each swath 
using two slaves, and the two resulting circuits were merged together afterwards. In 
practice, the speed gained through parallelism in the extraction phase was overwhelmed 
by the cost of merging the circuits together. The serial extraction actually took less elapsed 
time than the parallel extraction and merge [Lewffte 1966). 

4.3 Compiling and Linking Programs 

The automatic translation of nakef lies to execution control files is fairly straight- 
forward. Writing Mak*2E€F was simply a matter of changing the syntax of each task 
description. 

Since sake was used to control generation of tint £ PIC executable, and since £ PIC 
is composed of many different modules, it was fenmiphto choice for a benchmark. The 
data dependency graph for compiling and unking i£9g C '» in Appendix B. 

The chart in Figure 4.5 shows the results of simulating the execution based on unit 
task length and zero communications cost. The shape of the data dependency graph is far 
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Figure 4.5: EPIC analysis of Makefile simulation based on data dependency 

more regular than that of either DRG rules set. Due to the relative absence of terminal 
nodes, it was not always possible to "fill in the gaps" of processor idleness. Therefore, the 
processors were not well utilized if there were more than seven of them, even though the 
minimum (and maximum) extent of parallelism is 19. 
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Figure 4.6: Results and analysis of make epic 

Figure 4.6 shows the results of EPIC compilation tests run on a VAXCluster with 
up to 3 VAX 11/780 computers and on DECnet with up to 4 VAX 11/780 computers. The 
VAXclustered run showed a reasonable speedup with up to three processors, but more 
tests will have to be run to see how well these results will scale. 

The tests run with independent VAX computers indicate that the compilation of 
EPIC is not sufficiently compute-bound to allow it to be efficiently distributed over an 
Ethernet. As the chart shows, the file transfer overhead grew rapidly as the number of 



69 



> 

processors increased. Running parallel make over DECnet may become profitable if the 
data prefetch and delayed reporting extensions of Chapter 3 are applied to £ PIC. 
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Chapter 5 



Conclusion 



5.1 Summary 

In this thesis we have presented SPIC, the implementation of a software method- 
ology for coarse grained parallel processing. It is based on a computational model that is 
applicable to a variety of different problems. We have described the characteristics that a 
program must possess in order to be accelerated by I PIC. In addition, we have described 
the adaptation of several existing applications to parallel computation using £PJC, with 
varying degrees of success. 

Parallel DEC was particularly successful. The tests run indicate a performance 
increase that justifies the usage of the extra hardware. The base ORG program used in this 
thesis was EGAD's DJUCUU, but any design rule checker that uses intermediate files could 
have been used. The strategy for running DRCa in paraHel presented hero is only one of 
two promising approaches. We divided the D»C by allocating different rules in the design 
rule set to each processor. Also, the data partitioning scheme of [Bier, Plesskun 1985] will 
work with any design rule checker, and can be readily adapted to €P IC. 

5.2 Directions For Future Research 

The results presented in this thesis did not fully test the claims made about the 
extent of parallelism of either DRC or Makefiles. With more time and resources, it would 
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be interesting to try to execute a DRG as quickly as the critical path will allow. This would 
also give the task scheduling algorithm a more a substantial workout. The five and six 
machine tests showed optimal performance from the task scheduler, but that was too easy. 
To better support the claims made here about the scheduler's near-optimality in most real 
data dependency graphs, we need to run more tests with more machines- 

One way to further increase the extent of parallelism in VLSI design rule checking 
is to combine rule partitioning with data partitioning. Essentially, once the chip is divided 
into separate slices, several processors could be allocated to each *lke, and each slice could 
be be checked by exploiting rule-based parallelism. The whole computation could be con- 
trolled by £ PIC using a single execution control file. Another strategy would be to use a 
two-level hierarchy of star networks, with each master reporting to the grandmaster. The 
single-master approach requires a bit of effort to prevent naming conflicts with interme- 
diate layer and error files, but offers the advantage of automatically load-balancing the 
computation if any of the slices finish before any of the others. 

5.2.1 Other Applications 

EPIC provides the bam for the acceleration through parallelism of a potentially 
wide variety of existing software. Any computation controlled with Unix Makefiles can be 
automatically converted to be run in parallel with tPIC. Another VLSI CAD application 
that has the potential for acceleration via tPIC is mask pattern generation software. 
In particular, ECAD's NDP 1 software uses the same rules file format *&& pr e pro cessor as 
DRACULA, so it may work with the existing ECAD2ECF preprocessor with only minor syntactic 
additions. This was not erpkned further due to lack of time. 

Using ZPIC on VAXclusters, the data communications overhead becomes negligi- 
ble, and the set of programs that can be profitably accelerated th^^ 
greatly. One application that comes to mind is merge-sorting. This classic binary divide- 
and-conquer algorithm is ideal for IPIC. It would be fairly easy to adapt an existing 
merge-sort program for use with €PIC. The constraining factor is the time required to 
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write the partial lists into disk files. But this overhead is also incurred in serial merge-sorts 
if the list being sorted is too large to fit into physical memory. 

5.2.2 Reducing the overhead 

The disk file overhead issue brings to light another issue. £ PIC addresses a very 
coarse parallelism. The control communications overhead forces us to apply the constraint 
that a problem must be subdivided into tasks that each task a "long time" to execute. 
But £PICs model of parallelism doesn't require the loose coupling of the Ethernet envi- 
ronment. A more tightly coupled multiprocessor would be able to accelerate a wider range 
of applications. The concepts used in t PIC could be applied to a controller on such a 
processor. It would be interesting to see how such a system might develop. 

5.2.3 Lessons Learned about Distributed Programming 

In the past twenty years, there have been dramatic improvements in the quality of 
the tools used for programming. In particular, the recent advent of source line debugging 
for high level programming languages on the VAX/\f*IS operating system has allowed the 
programmer to more fully concentrate on the most interesting aspects of his task. Unfortu- 
nately, this capability is often less accessible to those writing distributed or asynchronous 
programs. If a program is invoked by creating a process on a remote processor, how will the 
debugger interact with the terminal? It is possible to work around this problem by having 
the remote process allocate a terminal that is directly connected to the remote processor. 
That is not very helpful if there are many processors or if they are physically inaccessible. 
Much work needs to be done in the area of distributed programming environments. 

Similarly, software engineering has advanced considerably from the days of FOR- 
TRAN and COBOL. The concepts of structured programing, data abstraction, object ori- 
ented programming, data driven programming, and so on are well documented, publicized, 
and lectured about in our undergraduate halls. In the course of implementing the mes- 
sage passing facility of £ PIC, less familiar methodologies had to be adopted to insure 
consistent data structures within a single processor, and to avoid deadlocks between two 
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communicating processors while guaranteeing message delivery. If parallel processors are 
to become a popular hardware platform, we must learn how to program them as well as 
we know how to program serial machines. 

5.3 Conclusion 

Several factors affect how well the potential for acceleration of GAD tools through 
parallelism will scale with time. As the complexity of VLSI circuits rises, the extent of par- 
allelism will rise due to geometric locality in the layouts, the constant overhead oCSPICa 
control communication will become negligible, and the overhead of data communication 
will most likely become less significant. Data communication will almost certainly not be- 
come more significant as the complexity of the chips rises. This is based on the assumption 
that CAD tools have time complexity > 0{n) where n represents the sise of the input files, 
since they must at least examine all their input. Hierarchical CAD tools are included in 
this assumption, because the file representation is hierarchical as well. Empirically, the 
time complexity for flat DRCs has been observed to be roughly Ofa 1 *) with n being the 
number of transistors [McGrath 1965]. 

Another factor that will determine how much extra speed we can squeeze out of 
parallelism is the power of the processors on which we run the CAD tools. The VAX 
8600 computer will run roughly four times as fast as the VAX 11/780 computer. Since 
Ethernet technology is used as the control conununications medium for both processors, 
the control communications overhead on VAX 8660 computers may be as much as four 
times as significant as the tests presented here indicate. 

This statistic is best put into perspective by comparing it to the difference between 
the complexity of circuits being fabricated in 1977, when the VAX 11/780 computer was 
introduced, and the complexity of the circuits of 1985, when the VAX 8600 computer was 
introduced. While processor speed may have improved by a nwtor of four, VLSI circuit 
complexity has increased by a factor of about twenty-five [Allen 1983]. 

Thus we predict that parallelism will continue to be a viable means for accelerating 
layout verification of VLSI circuits in years to come. £ PIC provides an inexpensive means 
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of substantially improving the throughput of existing software. As advances are made In 
both processor speed and the exploitation of hierarchy in CAD tools, parallelism can still 
be used to further reduce the execution time. 
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Appendix A 



f/VC/DRACULA User's Manual 



Parallel DRC is a method for running the ECAD's VLSI design rule checker (DRAC- 
ULA). By dividing the run into separate portions to be run on several computers, Parallel 
DRC reduces the amount of time required for a DRG run. A DRC using the standard 
method of running on one computer may require several days to run on a large chip. This 
time can be reduced to an overnight run using Parallel DRC. 

This appendix describes the following aspects of Parallel DRC: 

• How Parallel DRC works 

• Potential Benefits from running Parallel DRC 

• Environment for running Parallel DRC 

• How to run Parallel DRC 

A.l How Parallel DRC Works 

The program used to run Parallel DRC is called £PIC (Exploiting Parallelism In 
CAD). This program sets up processes on several computers to run portions of the DRC. 
The computers are logically arranged in a star network. The central computer, called the 
master, manages the work of all the other computers, called slaves. The entire design 
rule check is broken into separate tasks, with each task roughly corresponding to a single 
DRC rule. The master dynamically assigns tasks to the slaves, telling them what files are 
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needed to run the task. The slaves copy the files to their own directories and run the tasks. 
The master keeps a record of the file, each slave has. As each task is completed, the slave 
notifies the master and sends the DEC output files to the master's directory. The master 
then assigns another task to the slave. The execution jw>ceeds m thb inanner untu all the 
tasks are completed. The last step is for the muter to cciiO^ aU the ^pwaW error files 
(.ERR) into one file and append it to the summary (.SON) file. 

The WIITOt program allows the user to initiate and control the parallel execution, 
and provides a periodically updated ^^jdtkBm^^mck^^ptocms. 

A.2 Potential Benefits Prom Running Parallel DRC 

To evaluate whether or not you want to use £PJC to run ECAD DEC, you must 
understand the basic principle behind it DEC is not really a single program that must 
be run from start to fiiuah by a «n^ 

are typically run one after another. Each program 'commnnicates' to the others simply 
by reading and writing disk files. 

SPIC provides a mechanism to distribute the execution of these programs over 
several computers on a network. This distribution is very efficient m that alinost no work 
is duplicated by the extra computer.. The only extra wurit inferred » the file transfers 
needed to move the input and output files to the appropriate CPUs. 

Preliminary tests of Parallel DEC have demonstrated a speedup of 4.5x using 6 
computen to check a medium sue chip. The speedup ratio will approach the number of 
computers as the chip gets larger, since the time required to run the DEC rises fester than 
the sue of the data files. 

The greatest practical advantage of Parallel DEC occurs with chips that take serial 
DECs several days to run on a k»ded VAX ctm^oter 
to fight for CPU tm^ with mteiactiv. proc«a^ 

while further delaying the completion of the DEC. Using £PIC, it will be possible to 
complete the DEC overnight. That translates into a foster turnaround time for the layout 
designers, and less aggravation for the other users of the computer facility. 
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A.3 Environment For Running Parallel DRC 

A.3.1 Requirements For EPIC 

EPIC requires no special hardware configurations. It runs on any number of VAX 
computers, each running VAX/ VMS Version 4 or later, and all connected by DECnet. The 
system runs in a heterogeneous environment of VAXelurtered and unVAXclustered nodes. 
Informing EPIC which nodes are VAXchistered results in increased performance, due to 
the decrease in file transfer overhead. Running on Microvax computers is possible if there 
is enough disk space to hold the ECAD software and the chip data. 

EPIC requires that on each system, you have an account with the following char- 
acteristics: 

Proxy: *::USERKAME -*> USEKKMffi 

Privileges: NET*»X. TMPMBX, GRPMAM 

Buffered I/O Byte Count Quota: 13000 

Timer entry queue quota: 10 

Open file quota: 100 

Subprocess quota: 5 

You should define a logical EPIC to point to the area where the EPIC programs 
reside on your system. In addition, you need to set up two command files in your SYSlLOGIN 
area: MASTER.COM and SLAVE.COM. You can copy exampito dt these files from the EPIC 
distribution area. 

You will want to run the parallel DRC using a different subdirectory for each slave. 
This is obvious for unVAXclustered computers, but even when two nodes share a file 
system, their slaves should be provided with separate subdirectories. This is due to a 
restriction in the ECAD DRACULA system that causes input files to be read-locked even 
if they will not be rewritten. This eliminates the possibility of file-sharing, even on a 
VAXcluster, because if a process tries to open a file that a parallel process has already 
locked, a fatal error will be signalled. VAXclusters are still helpful, provided the master is 
running on the VAXcluster, since EPIC is smart enough to use local file transfers rather 
than DECnet file transfers between VAXclustered nodes. 
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EPIC allows you to map your slaves to your processors any way you want. In 
other words, you can have any number of slaves on each CPU. For Parallel DRC, the most 
efficient strategy is to assign only one slave for each processor. You can run the master on 
a processor that is already running a slave, since master doesn't consume very much GPU 
time. 

A.3.2 EC AD DRACULA Requirements 

You must have the ECAD system installed on each filegystem. VAXclusters only 
need it installed once, rather than once for each CPU. 

A.3.3 Input Requirements 

The input requirements are exactly the same as those for serial ECAD DRC. You 
must have a layout file in aoms format understood by ECAD, and you must know the 
primary cell name. You must also have a rules file (.DRC) describing the geometric tol- 
erances for the appropriate process technology. The rules file is used to generate control 
files that allow IP1C to run the Parallel DRC. 

A.4 Running A Parallel DRC 

A.4.1 Preprocessing Steps 

The £ PIC kernel has no knowledge of DRC. It can run DRC only by providing 
with it a parameter file, called an execution control file (with extension .ECF). This file 
can be generated directly from the DRC rules file using the program ECAD2ECF.EXE. This 
program also generates a command file that contains the DCL code that directly drives 
ECAD DRC. ECAD2ECF.EXE is easy to run, though it may take over an hour on a well- 
loaded VAX 11/780 computer. The following is an example of its use. We assume that 
CWJS . DRC is a rules file in the current default directory. 

$ RUI/HODEBUG EPIC:ECJLD2ECF 
Ecad file name: CMOS. DRC 
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ECF file name: CMOS.ECF 
COM file name: CMOS.COM 

WELETE-W-SEARCHFAIL, error searching for !AS 
$ 

CMOS . ECF must then be placed in the master's subdirectory. It contains information 
about each task needed to control the parallel execution. Specifically, for each task, it 
indicates all of the input files, all of the output files, and all of the DCL commands needed 
to generate those output files. 

CM0S.COM must be placed in the SYSlLOCIlT: area of each slave. We place it in 
SYSILOCIN rather than in the slave subdirectory so that We only have to store this rather 
large file once per VAXcluster (see the discussion above about file sharing on VAXclusters). 

Generally, the rules file for a given technology will remain fairly stable throughout 

time. The only information that changes mete often are the description parameters at the 

top of the rules file. These might change with each run. We want to avoid running the 

preprocessor as much as possible, since it is fairly time consuming. The best approach 

is to run it once for each generation of the process technology, using generic description 

parameters. Then, for each new set of description parameters, you must generate a new 

. ECF and a new . COM file by doing the appropriate global string replacements in the generic 

.ECF and .COM files. A program, FIXECAD . EXE, is provided for this purpose. It is fairly easy 

to use, and doesn't take very much time (typically less than a minute). It prompts for the 

old and new .ECF and .COM file names, and for the old and new description parameters. 

Since the program does unintelligent global string replacements, you must choose your 

generic description parameters so they will be unique. The appendix contains an example 

of the use of FIXECAD that also demonstrates appropriate generic description parameters. 

Sample . DEC, . ECF and . COM files for several technologies are provided in the £ P JC 

distribution. You may want to use these if they axe sufficiently up-to-date. You will still 

need to use FIXECAD to update the description parameters. 
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A.4.2 Running EPIC 

All user interaction with the tPIC system is through the M0HIT0R.EXE program. 
It is recommended that you run this program on the same processor as the master, though 
it is not required. MOMITOR uses the VAX/VMS Screen Management facility (SMG), so 
you must run it from a DEC supported terminal such as a VT100 or a VT200 series 
terminal. You can also run MONITOR in batch mode or from a command file. Normally 
you will want to initiate the program interactively, since the network connections that will 
be made occasionally fail on the first try doe to timeouts or network fiakiness. To save 
typing, you have the option of initiating the start-up from a command file and continuing 
or fixing any problems interactively. 

To start MOHITOR, use "t RUI/IODEB EPIC:M0IITOR". Your screen will then be 
divided into three segments. The top third contains process monitoring information. Each 
row in the display corresponds to a slave's subprocess, and is periodically updated to 
display a variety of statistics including CPU time, elapsed time, the name of the current 
program, and the number of tasks it has completed. The m i ddle thud is for error messages, 
status messages, and other diagnostics. The bottom third is for your input. 

The normal state of the program is that no prompt is offered. This is so that 

the monitor can respond to any messages it receives from the master. There is no master 

initially, so this may seem confusing. As soon as the user types so methin g, monitor provides 

a prompt in the bottom window and echoes what was type* thus far. While in this "read 

line" mode, the monitor cannot react to massages from the master, so the normal state is 

not to provide the prompt. If you type at monitor and it doesn't echo, that means it isn't 

finished doing what you last told it to do. If yon start to type s omethin g to monitor and 

decide not to issue a command, just type CTRL/V WTOHJ* to get rid of the prompt. 

Normally, the first thing to do is to create a master. Use the command 

CREATE/MASTER/PROXY node comftts ocfflle file-prefix cluster-list 

If you do not type in the arguments, you will be prompted for them The standard 

DCL parser and line editor are used, so you will be able to use the arrow keys to edit 

your input. Two special purpose keys are also assigned. PF1 terminates the current line 

(executes it) and clears the bottom two thirds of the screen. PF2 terminates the current 
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line and repaints the entire screen. 

The first argument, node, is the name of the node on which the master will be run. 
Don't put the double colon (::) in here, just the name of the node. The second argument, 
comfile, is usually MASTER, though you may have more than one version of this file that 
does different things with default directories and renaming of NETSERVER. LOG. Don't bother 
to specify the file extension, and don't include a device or directory specification; the file 
must reside in SYS$L0GIN. The third argument, ecfflle, is the name of the Execution 
Control File (for example CMOS . ECF), only don't bother to include the extension when you 
type it here. You can specify a device and directory, but you don't need to if it is the 
same as file-prefix, the fourth argument. File-prefix is the master's subdirectory. It can 
include a device and directory specification. The initial input file must be in this directory, 
and all intermediate files and the error summary file will be placed there, so there must be 
enough room on the disk. The last argument, cluster-list, is a list of machines that share 
the same filesystem as the MASTER'S node. Include node in this list. This information 
is used to optimize file transfers by using local fCOPYs rather than decnet transfers when 
appropriate. 

The PROXY qualifier is used because in some future version of SPIC, we may 
support password access. 

After pressing carriage return, the MONITOR causes a process to be created on 
node. This process executes comfile, which should run EPIC: MASTER. EXE, which will 
acknowledge communication with monitor. It will then try to read in ecffile. You will be 
told the outcome of this attempt, and that will be your cue to begin creating slaves. 

CREATE/SLAVE/PROXY name node comfile file-prefix 

The only new parameter is the name parameter. This is used because more than 
one slave per machine is supported by tPIC (though not recommended for DRACULA). 
The name is used as a substring in file names, process names and in the group logical 
name table. It should contain only aiphanumerics, and be no more than eight characters 
long. One would generally include the node name aq part of this name when running on a 
VAXcluster, so the log files will be identifiable. 

CREATE/SLAVE is not really executed by the monitor. The text of the command 
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is sent to the master, and the master executes the command. This allows yon to queue up 
several CREATE/SLAVE command* without waiting for the command to finish. Diagnos- 
tic messages will indicate the success or failure when the information becomes available. 
Success will also be indicated by a new active Mae in the upper third (process monitoring 
section) of the screen. 

tPIC supports the use of more than one VAXcluster. The following command 
tells £PIC about a VAXduster other than the one specified in the CREATE/MASTER 
command: 

SET/CLUSTER=nodel, node2, node*, nod*4 . . . 

The refresh cycle for process display is initially get to one minute. You can reset it 
to (for example) five seconds with the : 



If for any reason you need to kill a slave, use the foifcwring command: 

KILL/SLAVE alavoVnode tkWi-nsm« 

Again this command is not really eiw»ted by the monitor. The text of the command 
is sent to the master, and it does the dirty wort The result should be evident from the 
diagnostic message and process display. Yon can also do the dirty work yourself by stopping 
the slave's process on its node. In any case, £P1C will reassign that slave's task to another 
slave, and the computation wiB continue. If a slave fails due ton system crash, EPIC will 
behave similarly. The computation will go on with the expected degraded performance. 
You can also add a slave at any point in the computation with the CREATE/SLAVE 
command. 

You can kill the whole computation, including the master, with the KILL command. 
This is a dean way to abort the computation* The k>g and summary files for the proces s es, 
though not for the DRC, will be generated. You can also stop Aemast^s process yourself, 
and the slaves will terminate themselves soon t her e aft er. 

You can nee the monitor's EXIT command to get back to DCL. It is OK to do this 
while a computation is running. To get back in touch with a master that you have left on 
its own for a while, get back into the monitor, and use the command 

MONITOR/PROXY 
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Performance will be much better if you do this while logged into rnaster's-node. 

A.4.3 Triggering The Parallel DEC 

This is essentially automatic. As soon as the CREATE/MASTER completes, the 
master begins an initial step in the DRC in a subprocess. This is a task that must be 
completed before any of the slaves can be given any work. Normally, you will have created 
all the slaves before the MASTER finishes this step, but you can create slaves at any time, 
and they will be put to work if there is work to be done. 

For completeness, we mention that the subprocesses in which the actual DRC 
is run do not inherit any process logical names or symbols you may have defined 
in your L0GIM.COM. This should not affect an EGAD DEC, but if you create a file 
SYSlLOGIH: EPICIHIT.COM, it will be executed by the each subprocess before it starts 
running the DCL commands specified in the . ECF file. 

A.4.4 Summary Files 

In addition to the DRC summary file that is created in the master's subdirectory, 
tPIC leaves several other files in various places around your file system Two summary 
files will be created in SYSILOGIM on the master's computer. EPXCSTATUS . LOG will contain 
a chart indicating the cpu time, the real time, and some other parameters for each slave. 
EPICEXEC . PS is a Postscript file that can be printed on an Apple LaserWriter 1 . It contains 
a graphical representation of the parallel execution. The leftmost column indicates the 
elapsed time at several points on the Y-axis. Each vertical column represents the activity 
of a slave. Each diamond is the execution of a task or rule. The height of the diamond is 
proportional to the amount of time it took to execute it. Each line segment between two 
tasks represents a data dependency between those tasks, and roughly corresponds to a file 
transfer. System .LOG files documenting the actual VAX/ VMS programs run to execute 
the DRC are generated in whatever directory was the default directory when EPIC .SLAVE 
and EPIC:MASTER were initially run. MASTER.LOG and SLAVE.LOG are generated according 

1 LaserWriter k a trademark of Apple Computer Corporation 
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to the contents of MASTER.COM and SLAVE.COM. MASTER.LOG contains all the diagnostic 
messages sent to the middle screen of the monitor. 
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A. 5 Appendix 

A.5.1 Sample Run Of EPIC: FIXECAD 

Note: You don't have to specify anything, for the old and new versions of a field if 
you don't want to change that field. Every time a substitution is made, the old line and 
the new line are printed out. Much of this was editted out of the example below. 
$ run epic : f ixecad 



Old COM 
Old ECF 
New COM 
New ECF 



caos 

CBOS 

field 
field 



Old Indisk: infile.gds 

New Indisk: field. gda 

Old Outdisk: outfile.err 

New Outdisk: outfield. err 

Old Print: suaaary 

New Print: suaaary 

Old Priaary: maincell 

New Priaary: field 

Old Systea: gds2 

New Systea: gds2 

Old Dir: segcadlecad: 

New Dir: segcadlecad: 

1 TREEMAIN 
1 TREEFIEL 

IASSIGN INFILE.GDS FOR009 
IASSIGN FIELD. GDS FOR009 

TREEMAIN 
TREEFIEL 

1000 1 MAINCELL 
1000 1 FIELD 

IASSIGN OUTFILE.ERR FOROOO 
IASSIGN OUTFIELD. ERR FOROOO 

TREEMAIN OUTMAINCELL 
TREEFIEL OUTFIELD 

.TREEMAIN.DAT- 
.TREEFIEL.DAT- 

/DCL- ("ICSYSILOGIN: CMOS. COM 1"- 

/DCL- ("HSYSILOGIN: FIELD. COM !"• 
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A.5.2 Execution Control File 

The following is an example of one task in the ECF file created above. 

task " NOT TOTHtfL MASKLR NWELL"- 
/IHPUT -(TOTNWL.DAT- 

.MASKLR.DAT- 

)- 
/OUTPUT- (NWELL.DAT- 

)- 
/DCL- ("|«SYS$L0GIN: FIELD. COM 16"- 

) 

A.5.3 Command File 

Each page of the .COM file corresponds to an ,ECP task, such as the one above. At 
the beginning of the . COM file, there is a $ GOTO ' Pi » , which explains how the correct step 
gets executed. 

$ 16: 

$ ! NOT TOTVWL MASKLR MVELL 

$ ! 

ISET PROCESS/NAME- 16GDSIN 

IRON SEGCADIECAD: LOGICAL 

3 TOTNVL MASKLR NVELL 1000 MIC 

$IF .NOT. ISTATUS THEN GOTO LQUIT 

♦OUTPUT: 

$IF P2 .EQS. "OUTPUT" THEN GOTO LQUIT 

IEXIT 
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Appendix B 



Data Dependency Graphs 



This appendix contain* printed representations of the data dependency graphs used 
in the testing of £ PIC. Included are examples for DEC CMOS design rules, MOSIS CMOS 
design rules, and the compilation and linking of IPIC. 
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DEC CMOS DRC Data Dependency Graph 
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MOSIS CMOS DRC Data Dependency Graph 





♦ 9 
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"make EPIC" Data Dependency Graph 
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Appendix C 



Data from the testing of £ PIC 



This appendix contains raw statistics generated by £ PJC for the test runs with a 
varying number of processors. Each section consists of all the data for a single application. 
Each subsection has a table of statistics and a graphical log for a single run. The leftmost 
column of the graphical log indicates the elapsed time at several points on the Y-axis. Each 
vertical column represents the activity of a single slave. Each diamond is the execution 
of a task. The height of the diamond is proportional to the amount of time it took to 
execute the corresponding task. Each line segment between two tasks represents a data 
dependency between those tasks, and roughly corresponds to a file transfer. 
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C.l dracula with DEC CMOS rules 

C.1.1 Serial DRACULA on a VAX 11/780 computer 

Buffered I/O count: 8830 Peak working set size: 8060 

Direct I/O count: 66129 Peek page file size: 19635 

Page faults: 239201 Mounted volumes; 

Charged CPU tine: 04:00:11.84 Elapsed tine: 06:06:60.41 

C.1.2 Parallel DRACULA on three VAXclustered VAX 11/780 
computers 

9- APR- 1986 07:22:43.28 

Accounting information (for the "MASTER" process): 

Buffered I/O count: 6728 Peak working set size: 8000 

Direct I/O count: 12629 Peak Tirtual size: 18898 

Page faults: 80986 Mounted volumes: 

Images activated: 644 

Elapsed CPU tine: 00:23:23.61 

Connect time: 02:62:61.39 

Total "SLA?E" statistics: 

Elapsed seconds: 32717 
CPU seconds: 14210 
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C.1.3 EPIC using one VAX 11/780 computer 

MASTER Statistics for EPIC run using ECF file DECCMOS.ECF 

EPIC Version VI. 

29-MAR-1986 18:45:38.05 BUFFIO: 3549 

ELAPSED: 05 : 21 : 28. 13 DIRIO : 487 

CPU: 0:01:13.19 FAULTS: 720 

Subprocess statistics (all times in seconds) 




"irar 



Muter 

Slav* 



OPU 






II fotaf | 6 j ill 




!*!«• 
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DECCMOS.ECF run on 29-MAR-1 986 18:45:45 



Hours 



1 Hours 



2 Hours 



3 Hours 



4 Hours 



5 Hours 
5 Hours 
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C.1.4 EPIC using two VAXclustered VAX 11/780 computers 



MASTER Statistics for EPIC run using ECF file DECCMOS.ECF 

EPIC Version VI. 

28-MAR-1986 07:21:18.95 BUFIO: 3170 

ELAPSED: 02:51:09.61 OIRIO: 858 

CPU: 0:01:07.09 FAULTS: 702 

Subprocess statistics (all tiass in seconds) 
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DECCMOS.ECF run on 28-MAR-1986 0751 :27 



Minutes 

10 Minutes 

20 Minutes 

30 Minutes 

40 Minutes 

50 Minutes 

60 Minutes 

70 Minutes 

80 Minutes 

90 Minutes 

100 Minutes 

110 Minutes 

120 Minutes 

130 Minutes 

140 Minutes 

150 Minutes 



mmu 
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C.1.5 EPIC using three VAXclustered VAX 11/780 computers 



MASTER Statistics for EPIC run using ECF file DECCMOS.ECF 

EPIC Version VI. 

27-MAR-1986 06:28:09.75 BUFIO: 3179 

ELAPSED: 01:57:56.11 DIRIO: 899 

CPU: 0:01:06.27 FAULTS: 709 

Subprocess statistics (all tiaes in seconds) 



Nod. 


CPU 


File 
Time 


Exec 
CPU 


l»c" 
T$am« 


Uh 
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SlaveO 
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6133 

Mil 
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•4ST 


•7 
M 


TOTAL 


d 


ft 


ULULHi; :L»ttU£vj| 



»m 



74, 



Peak 



"15150- 

205*© 



30077 



Wlrg.et 
Peak 



•000 
•000 
•000 



Page 
Faults 



2374 
319370 
138045 
197093 



TsTnr 
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DECGMOS.ECF run on 27-MAR-1986 0628:18 



Minutes 



10 Minutes 



20 Minutes 



30 Minutes 



40 Minutes 



50 Minutes 



60 Minutes 



70 Minutes 



80 Minutes 



90 Minutes 



100fcfinutes 



109 Minutes 
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C.1.6 EPIC using two independent VAX 11/780 computers 

MASTER Statistics for EPIC run using ECF file DECCMOS.ECF 
EPIC Version VI. 

9- APR- 1986 03:31:27.82 BUFIO: 3434 

ELAPSED: 03:15:42.65 DIRIO: 508 

CPU: 0:01:23.19 FAULTS: 830 

Subprocess statistics (all times in seconds) 




II T.i»l 
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DECCMOS.ECF run on 9-APR-1986 03:31 :36 



Minutes 

10 Minutes 

20 Minutes 

30 Minutes 

40 Minutes 

50 Minutes 

60 Minutes 

70 Minutes 

80 Minutes 

90 Minutes 

100 Minutes 

110 Minutes 

120 Minutes 

130 Minutes 

140 Minutes 

150 Minutes 

160 Minutes 

170 Minutes 

180 Minutes 
188 Minutes 
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C.1.7 EPIC using three independent VAX 11/780 computers 



MASTER Statistics for EPIC run using ECF file DECCMOS.ECF 

EPIC Version VI. 
7-APR-1086 06:43:13.96 BUFIO: 3478 
ELAPSED: 02:13:40.10 DIRIO: 502 
CPU: 0:01:18.17 FAULTS: 827 

Subprocess statistics (all times in seconds) 
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DECCMOS.ECF run on 7-APR-1936 06:4351 



Minutes 

10 Minutes 

20 Minutes 

30 Minutes 

40 Minutes 

50 Minutes 

60 Minutes 

70 Minutes 

80 Minutes 

90 Minutes 

100 Minutes 

110 Minutes 

120 Minutes 
126 Minutes 
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C.1.8 € PIC using four independent VAX 11/780 computers 

MASTER Statistics for EPIC run using ECF fils DECCMOS.ECF 
EPIC Version VI. 

6- APR- 1986 02:12:26.93 BUFIO: 3379 

ELAPSED: 01:36:10.71 DIRIO: 398 

CPU: 0:01:20.47 FAULTS: 837 

Subprocsss statistics (all timss in seconds) 



't*. 
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DECCMOS.ECF run on 6-APR-1 986 02:1 2:44 



Minutes 



10 Minutes 



20 Minutes 



30 Minutes 



40 Minutes 



50 Minutes 



60 Minutes 



70 Minutes 



80 Minutes 



88 Minutes 
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C.1.9 EPIC using five independent VAX 11/780 computers 

MASTER Statistics for EPIC run using ECF file DECCMOS.ECF 
EPIC Version Vi.O 

12-APR-1986 06:22:44.11 BUFIO: 3606 

ELAPSED: 01:26:68.26 DIRIO: 418 

CPU: 0:01:10.61 FAULTS: 866 

Subprocess statistics (all times in ssconds) 



ftlMWl 
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DECCMOS.ECF run on 12-APR-1986 06:22:51 



Minutes 



10 Minutes 



20 Minutes 



30 Minutes 



40 Minutes 



50 Minutes 



60 Minutes 



70 Minutes 



78 Minutes 
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C.1.10 EPIC using six independent VAX 11/780 computers 

MASTER Statistics for EPIC run using ECF file DECCMOS.ECF 
EPIC Version VI. 

10- APR- 1986 06:12:36.06 BUFIO: 3529 

ELAPSED: 01:14:21.67 DIRIO: 406 

CPU: 0:01:14.59 FAULTS: 879 

Subprocess statistics (all times in seconds) 
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DECCMOS.ECF run on 10-APR-1986 06:12:43 



Minutes 



10 Minutes 



20 Minutes 



30 Minutes 



40 Minutes 



50 Minutes 



60 Minutes 



67 Minutes 
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C.2 Compiling and Linking EPIC 

The following statistics were generated by VMS after compiling and linking EPIC and 
its preprocessors. 

Accounting information: 
Buffered I/O count: 962 Peak working set size: 3886 

Direct I/O count: 2599 Peak virtual size: 7904 

Page faults: 31011 Mounted volumes: 

Images activated: 26 

Elapsed CPU time: 00:06:48/74 
Connect time: 00:10:19.12 
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C.2.1 EPIC using one VAX 11/780 computer 

MASTER Statistics for EPIC run using ECF tils MAKEEPIC.ECF 

EPIC Vsrsion ¥1.0 
30-MAR-1986 13:50:40.42 BUFIO: 732 
ELAPSED: 00:10:90.00 DIEIO: 81 
CPU: 0:00:14.42 FAULTS: 231 

Subprocsss statistics (all tiass in ssconds) 
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MAKEEPIC.ECF run on 30-MAR-1986 13:50:48.10 



Minutes 



1 Minutes 



2 Minutes 



3 Minutes 



4 Minutes 



5 Minutes 



6 Minutes 



7 Minutes 



8 Minutes 



Minutes 



18MiRUtS§ 
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C.2.2 EPIC using two VAXclustered VAX 11/780 computers 

MASTER Statistics for EPIC run using ECF file MAKEEPIC.ECF 
EPIC Version VI. 

1 -APR- 1986 02:56:50.84 BUFIO: 923 

ELAPSED: 00:05:39.54 DIRIO: 101 

CPU: 0:00:16.24 FAULTS: 248 

Subprocsss statistics (all tiaes in ssconds) 
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MAKEEPICiECF run on 1-APR-1 986 02:56:17 



Seconds 
10 Seconds 
20 Seconds 
30 Seconds 
40 Seconds 
50 Seconds 
60 Seconds 
70 Seconds 
80 Seconds 
90 Seconds 
100 Seconds 
110 Seconds 
120 Seconds 
130 Seconds 
140 Seconds 
150 Seconds 
160 Seconds 
170 Seconds 
180 Seconds 
190 Seconds 
200 Seconds 
210 Seconds 
220 Seconds 
230 Seconds 
240 Seconds 
250 Seconds 
260 Seconds 
270 Seconds 
280 Seconds 
290 Seconds 
300 Seconds 
309 Seconds 
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C.2.3 SPJC using three VAXclustered VAX 11/780 computers 

MASTER Statistics for EPIC run using ECF fils MAKEEPIC.ECF 
EPIC Version VI. 

6- APR- 1986 23:27:38.06 BUFIO: 026 

ELAPSED: 00:04:12.70 DIRIO: 166 

CPU: 0:00:16.29 FAULTS: 269 



Subprocsss statistics (all tints in ssconds) 
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MAKEEPICECF run on 6-APR-1986 23:27:45 



Seconds 
10 Seconds 
20 Seconds 
30 Seconds 
40 Seconds 
50 Seconds 
60 Seconds 
70 Seconds 
80 Seconds 
90 Seconds 
100 Seconds 
110 Seconds 
120 Seconds 
130 Seconds 
140 Seconds 
150 Seconds 
160 Seconds 
170 Seconds 
180 Seconds 
190 Seconds 
200 Seconds 



210 Seconds 
213 Seconds 
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C.2.4 EPIC using two independent VAX 11/780 computers 

MASTER Statistics for EPIC run using EOF fils NAKEEPIC.ECF 
EPIC Version VI. 

3- APR- 1086 02:27:38.02 BUPIO: 1028 

ELAPSED: 00:07:57.36 DIRIO: 51 

CPU: 0:00:18.33 FAULTS: 286 



Subprocsss statistics (all tiass in ssconds) 
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MAKEEPIC.ECF run on 3-APR-1986 02:27:46 




Seconds 



100 Seconds 



200 Seconds 



300 Seconds 



400 Seconds 



447 Seconds 
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C.2.5 SPIC using three independent VAX H/780 computers 

MASTER Statistics for EPIC run using EOF fils MAKEEPIC . ECF 
EPIC Vsrsion VI. 

3- APR- 1066 02:06:48.90 BUFIO: 1067 

ELAPSED: 00:06:32.75 DIRIO: 64 

CPU: 0:00:18.26 FAULTS: 204 

Subprocsss statistics (all tiaas in ssconds) 
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MAKEEPIC.ECF run on 3-APR-1 986 02:06:57 




Seconds 

10 Seconds 

20 Seconds 

30 Seconds 

40 Seconds 

50 Seconds 

60 Seconds 

70 Seconds 

80 Seconds 

90 Seconds 

100 Seconds 

HOSeoonds 

120 Seconds 

130 Seconds 

140 Seconds 

150 Seconds 

160 Seconds 

170 Seconds 

180 Seconds 

190 Seconds 

200 Seconds 

210 Seconds 

220 Seconds 

230 Seconds 

240 Seconds 

250 Seconds 

260 Seconds 

270 Seconds 

280 Seconds 

290 Seconds 

300 Seconds 

310 Seconds 

320 Seconds 

330 Seconds 

340 Seconds 

350 Seconds 




121 



C.2.6 EPIC using four independent VAX 11/780 computers 

MASTER Statistics for EPIC run using EOF fils KAKEEPIC.ECF 
EPIC Version VI. 

3- APR- 1966 02:16:04.72 BUFIO: 1086 

ELAPSED: 00:06:32.38 DIRIO: 44 

CPU: 0:00:17.95 FAULTS: 311 

Subprocsss statistics (all tiass in ssconds) 
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MAKEEPIC.ECF run on 3-APR-1986 02:16:12 



Seconds 

10 Seconds 

20 Seconds 

30 Seconds 

40 Seconds 

50 Seconds 

60 Seconds 

70 Seconds 

80 Seconds 

90 Seconds 
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Appendix D 



EPIC Messages 



This Appendix contains all the messages sent as control communication. They 
effectively define the architecture of the software behind EPIC 

D.l Messages sent from user to monitor 

EXIT 

Terminate the MONITOR program. This does not affect the operation of the 
master. 

CREATE/MASTER node corn-file ecf-file working-directory cluster-list 
Create a master 

CREATE/SLAVE name node corn-file working-directory 

Tell the master to create a slave and put it in the database 

MONITOR master's-node 

Establish communication with an already-existing master 

KILL 

Tell the master to terminate the computation and generate the log files 

KILL/SLAVE slave-node slave-name 

Tell the master to terminate the slave and insert its task (if any) into the ready 
queue 

SET/CLUSTER = (nodel, node2 . . . ) 

Tell the master to define a set of nodes to be clustered together 

SET/REFRESH = time interval 

Tell the master to set the interval at which the process rate is refreshed 
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D.2 Messages sent from monitor to master 

SET/CLUSTER = (nodel, node2 . . . ) 

Define a set of nodes to be clustered together 

SET/REFRESH = time interval 

Set the interval at which the process rate is refreshed 

EXIT 

Terminate the computation and generate the log files 

KILL/NAME = slave's name /NODE = slave's node 

Terminate the slave and insert its task (if any) into the ready queue 

CREATE/SLAVE name node command-file working-directory 
Create a slave and put it in the database 

D.3 Messages sent from master to monitor 

MESSAGE msg 

Allows the master to put an arbitrary message on the monitor's screen 

STATUS line-number contents 

Send statistics fine describing slave's subprocess' CPU usage to the monitor's 
process display 

DONE 

Indicates to the monitor that the whole computation has completed. 

D.4 Messages sent from master to slave 

START task-name /INPUT=(ml, in3 . . . ) /OUTPUT=(«itt, out2 . . . ) /DCL=(dcU, dd2 . .. ) 

start the task with the spedfied inputs, outputs and del commands 

EXIT 

Terminate the slave subprocess and exit 

FREE 

Charge elapsed time to the FREE counter, rather than the IDLE counter 

SET/REFRESH = time interval 

Set the interval at which the slave sends process line information 

D.5 Messages sent from slave to master 

COMPLETED 

The slave completed its task 
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FAILED reason 

The slave failed its task 

STARTED 

The slave has retrieved the input files and started the task 
MESSAGE mag 

Allows the slave to put an arbitrary textual message into the master's log file 
STATUS status line 

Send statistics line describing slave's subprocess' CPU usage to the master for the 

monitor's process display 

FINAL statistics 

Send final statistics about the slave's subprocess' CPU usage, etc., to the master. 
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